9 - Machine Learning for Physicists [ID:52683]
50 von 556 angezeigt

Okay, so maybe let's continue with our discussion of reinforcement learning.

So last time we introduced one of the oldest reinforcement learning algorithms which is

called policy gradient and the name comes from saying I have a policy, a strategy that

I want to optimize and I do it by gradient descent.

Now today we are going to look at a different branch of reinforcement learning methods and

it's already visible a little bit here.

So this is again the AlphaGo paper and last time I pointed out, look, there's your policy

network so it takes in the image that depicts the state of the game board and it outputs

probabilities for taking certain actions, so putting stones in different positions.

So that's the kind of policy represented by neural network that we anyway were discussing.

But next to it in this picture taken from that paper you see something else and it's

called a value network and that's the kind of thing that we're going to discuss today.

So the big branch of reinforcement learning that I want to cover today is called Q-learning,

which is what we do as in quality and we will see why it's called like that.

So this is an alternative to the policy gradient approach and instead of learning a policy,

you learn a function that teaches you how good is a certain move and then of course

you always want to take the best move.

So you introduce a quality function that will predict in simple words the future expected

reward if you're in a given state and would take a certain action, then if you run the

game many many times, what's the expected reward?

So that's the Q-function.

We will write down the definition in a moment.

But once you have such a Q-function, it's very easy to do the best strategy so you always

take the action which promises the biggest future reward.

So to make it really clear, again robot and boxes.

The first thing we could ask, this is not yet the Q-function, is what's the value of

a given state?

So here I'm assuming the state is just the location, that's the simplest setting, and

what's the value of the state?

And obviously if you're right on the box, you can take the box and so this is a very

valuable state because you get a high reward.

If you're one or two steps away from a box, then this is not quite as valuable because

you still need to move towards the box and maybe you have a certain time limit or maybe

every step is costly, so then it's not quite as good but it's still pretty good.

So you can imagine that qualitatively a value function would look like this one even if

I haven't yet defined exactly what is a value function.

Now you can make it a little bit more fine-grained, your analysis, instead of saying what's the

value of a state, you say I have a state and I imagine doing a certain action and then

what will happen in the future, how much reward will I get?

And then of course if you consider the action of going up, of course the future reward will

be best if you happen to be just below the box because if you then go up you will pick

up the box, so that's great.

If you're a little bit further away, it's still not terrible but you get a lower reward

because there's more steps that you need to take.

Now this kind of plot you can imagine exists for any kind of action, so if you have four

actions going up and down and left and right, there would be four images of this kind depicting

four different pieces of the quality function.

So that's the rough idea.

Now we're going to definitions.

So again in words the quality function is just the expected future reward.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:00:47 Min

Aufnahmedatum

2024-07-04

Hochgeladen am

2024-07-05 09:39:04

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen