37 - Recap Clip 7.4: Value/Policy Iteration [ID:30440]
27 von 27 angezeigt

We looked at essentially two kinds of algorithms. One is value iteration, which

essentially hinges on this Bellman equation. Bellman equation, very simple

observation, is that you can compute the utility of a state by taking the reward

given by the state plus the discounted utility of the best actions. Remember

actions here are non-deterministic, so the utility of the best action is given

as a sum of all the possible states we end up in after the action. Remember our

example here. Where is it? We have like for every action there's essentially

three things, three states, up to three states you can actually end up in. So the

sum in the Bellman equation in this particularly simple example is a

three-way sum. Okay and so what this Bellman equation gives us is essentially

an update function. We know what the rewards are, so we can basically compute

ahead and the idea about this value iteration algorithm is just that

we use this update and iterate that and hope for the best. And here's the

algorithm which is essentially copied down the math from the last slide.

Program, the update, have some kind of a termination criteria, some error we want

to go below and then we're done. A very simple idea, surprisingly it works.

And there is a good reason why it works, it's always the same reason with these

iterations, because we're reaching a fixed point because something

contracts. Essentially there's Banach's fixed point theorem. In the background

something gets smaller and if you iterate something that contracts you're

going to end up in a unique fixed point. The only trick here is to find out what

contracts and that's something you've looked at. And then there's various, as

you can imagine, variants of that. One of this is policy iteration which basically

says we take the instead of just looking only at the values, we

compute policies from given values. And then we kind of iterate

instead of on the values, we iterate on the policies. And still very similar, still

using the Bellman equation but converges a little bit faster in many cases.

Teil eines Kapitels:
Recaps

Zugänglich über

Offener Zugang

Dauer

00:04:42 Min

Aufnahmedatum

2021-03-30

Hochgeladen am

2021-03-31 11:07:53

Sprache

en-US

Recap: Value/Policy Iteration

Main video on the topic in chapter 7 clip 4.

Einbetten
Wordpress FAU Plugin
iFrame
Teilen