Welcome back to deep learning and today we want to discuss a couple of other reinforcement
learning approaches than the policy iteration concept that you've seen in the previous video.
So let's have a look at what I've got for you today.
So we have other solution methods.
You see that in the policy and value iteration that we discussed earlier, they require updated
policies during the learning to obtain better approximations of our optimal state value
function.
So these are called on-policy algorithms because you need a policy and this policy is being
updated.
Additionally, we assume that the state transition and the reward are known.
So the probability density functions that produce the new states and the new reward are known.
If they are not, then you can't apply the previous concept.
So this is very important.
And of course, there are methods where you can then relax this.
So these methods mostly differ on how they perform the policy evaluation.
So let's look at a couple of those alternatives.
The first one that I want to show you is based on Monte Carlo techniques.
This applies only for episodic tasks.
And here the idea is off-policy method.
So you learn the optimal state value by following an arbitrary policy.
It doesn't matter what policy you're using.
So it's an arbitrary policy, could be multiple policies.
Of course, you still have the exploration exploitation dilemma.
So you want to choose policies that really visit all of the states.
But you don't need information about the dynamics of the environment because you can simply
sample.
So you can run many of the episodic tasks and you try to reach all of the possible states.
If you do so, then you can generate those episodes using some policy.
And then you loop backwards over one episode and you accumulate the expected future reward
because you have played the game until the end.
Then you can go backwards over this episode and accumulate the different rewards that
have been obtained.
And if a state was not yet visited, you append it to a list.
And essentially, you use this list then to compute the update for this date value function.
So you see this is simply the sum over these lists for that specific state.
And this will allow you to update your state value.
And this way, you can then iterate in order to achieve the optimal state value function.
Now another concept is temporal difference learning.
This is an on policy method.
But again, it does not need information about the dynamics of the environment.
So here the scheme is that you loop and follow a certain policy.
Then you use an action from the policy to observe the rewards and the new states.
And then you update your state value function using the previous state value function, sum
alpha that is used to weigh the influence of the new observations, the new reward, the
discounted version of the old state value function of the new state, and you subtract
the value of the old state.
So this way, you can generate updates.
And this actually converges to the optimal solution.
And a variant of this estimates actually the action function.
And this is then known as Sarsa.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:07:28 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 20:56:26
Sprache
en-US
Deep Learning - Reinforcement Learning Part 4
This video discusses several other solution approaches to learning games such as Monte Carlo Techniques, Temporal Difference Learning, Q Learning, and learning universal function approximators for reinforcement learning using the policy gradient.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning