10 - Machine Learning for Physicists [ID:8163]
50 von 838 angezeigt

The following content has been provided by the University of Erlangen-Nürnberg.

Okay, hello, good evening. Welcome again. We are currently discussing reinforcement

learning. That means to learn a strategy to behave in some environment, even if you only

get very rare rewards as the only feedback. So there is no teacher telling you what to do.

And so I want to start today's lecture by reminding you of what we did last time,

and then we will switch to a very, very simple example where I think it's even the simplest

possible example for reinforcement learning. Okay, let us briefly go through the setting again.

The setting is shown here. We have some agent, which in the end will be our computer program,

but as a physical object it could be a robot or a car. And that agent is called like that because

it can do some things, even if it maybe only just moves around. And whenever it does so,

it changes the overall state of its environment. And at the same time, it's also able to observe

the environment. So there are actions of the agent affecting the state of the environment,

and the state of the environment being observed by the agent. The challenge is, as I said,

there is no teacher, so we don't do supervised learning because maybe we don't even know ourselves

what would be the correct action. If you think of a complicated game, maybe you don't know how to

play this game optimally. So then what to do is to use reinforcement learning. That is to train,

for example, a network to produce actions, to tell us which actions to take based on these

rare rewards. And this is the typical scenario that you should imagine to have something concrete

in mind. You have some player, some character that moves around, and which sometimes collects

a reward by doing certain things. And this very simple example, just because it passes across any

of these green points that represent some treasures. And in the end of the game, maybe after 100 time

steps, you are told the reward. And now you have to do something based on this reward. You have to

decide how to change your actions in the future. And so the simplest way to do this is called policy

gradient or reinforce. That's already pretty old. It's the simplest model free general reinforcement

technique. And the basic idea is surprisingly simple. The basic idea is this, that first of all,

it's a probabilistic model. So you have to decide on the probabilities of taking various actions,

and then you just throw dice. And then you do the action, and then again and again. So it's a

probabilistic model for your agent. And the purpose of a neural network in this context would be to

take the current state of the environment and to predict all these probabilities for the actions.

So now then, how do you teach this network? Well, you take the reward in the end, and if it turns

out to be high, if it has been a good trajectory, so to speak, then you try to make all the actions

in this trajectory more likely in the future. You increase the probabilities for all the actions

that were actually taken. This will also include some stupid actions, of course. But since they

will occur more likely in the trajectories that have a lower reward, overall you still do the

right training in this way. Okay, so to make it more concrete, what you are given is a state S,

which could be this entire picture, this entire map. What you want to know is the probabilities

for various actions A, which in that case could be the four directions in which you could move.

And for each of these four directions, you want to calculate the probability to go along this

direction. So this pi theta of A given S is called the policy. It's a probability distribution over

the actions given the state. And the theta here just represents generally the parameters of the

network, for example, that calculates this policy. Or maybe you don't even have a neural network,

you parameterize it in a different way. But these are the things that you want to change,

that you want to update during training. Okay. And then we said after you take the action,

you land in a new state. Maybe you just moved up one step. Possibly the transition to this new

state even contains some probabilistic element, because maybe the environment itself is a little

bit noisy. Or it could be purely deterministic. So that doesn't really play a role for the concept.

And then you can calculate what's the overall probability for a given trajectory tau that

consists of all the sequence of states and actions that you take. And of course, you get it by just

multiplying together the probabilities for the various steps. So the probability to take an

action given a state, and the probability for the environment then to make a transition from the old

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:28:08 Min

Aufnahmedatum

2017-07-10

Hochgeladen am

2017-07-11 11:16:25

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen