44 - Deep Learning - Plain Version 2020 [ID:21178]
50 von 90 angezeigt

Welcome back to deep learning and today we want to discuss a couple of other reinforcement

learning approaches than the policy iteration concept that you've seen in the previous video.

So let's have a look at what I've got for you today.

So we have other solution methods.

You see that in the policy and value iteration that we discussed earlier, they require updated

policies during the learning to obtain better approximations of our optimal state value

function.

So these are called on-policy algorithms because you need a policy and this policy is being

updated.

Additionally, we assume that the state transition and the reward are known.

So the probability density functions that produce the new states and the new reward are known.

If they are not, then you can't apply the previous concept.

So this is very important.

And of course, there are methods where you can then relax this.

So these methods mostly differ on how they perform the policy evaluation.

So let's look at a couple of those alternatives.

The first one that I want to show you is based on Monte Carlo techniques.

This applies only for episodic tasks.

And here the idea is off-policy method.

So you learn the optimal state value by following an arbitrary policy.

It doesn't matter what policy you're using.

So it's an arbitrary policy, could be multiple policies.

Of course, you still have the exploration exploitation dilemma.

So you want to choose policies that really visit all of the states.

But you don't need information about the dynamics of the environment because you can simply

sample.

So you can run many of the episodic tasks and you try to reach all of the possible states.

If you do so, then you can generate those episodes using some policy.

And then you loop backwards over one episode and you accumulate the expected future reward

because you have played the game until the end.

Then you can go backwards over this episode and accumulate the different rewards that

have been obtained.

And if a state was not yet visited, you append it to a list.

And essentially, you use this list then to compute the update for this date value function.

So you see this is simply the sum over these lists for that specific state.

And this will allow you to update your state value.

And this way, you can then iterate in order to achieve the optimal state value function.

Now another concept is temporal difference learning.

This is an on policy method.

But again, it does not need information about the dynamics of the environment.

So here the scheme is that you loop and follow a certain policy.

Then you use an action from the policy to observe the rewards and the new states.

And then you update your state value function using the previous state value function, sum

alpha that is used to weigh the influence of the new observations, the new reward, the

discounted version of the old state value function of the new state, and you subtract

the value of the old state.

So this way, you can generate updates.

And this actually converges to the optimal solution.

And a variant of this estimates actually the action function.

And this is then known as Sarsa.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:07:28 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 20:56:26

Sprache

en-US

Deep Learning - Reinforcement Learning Part 4

This video discusses several other solution approaches to learning games such as Monte Carlo Techniques, Temporal Difference Learning, Q Learning, and learning universal function approximators for reinforcement learning using the policy gradient.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen