The following content has been provided by the University of Erlangen-Nürnberg.
Okay, hello, good evening. Welcome again. We are currently discussing reinforcement
learning. That means to learn a strategy to behave in some environment, even if you only
get very rare rewards as the only feedback. So there is no teacher telling you what to do.
And so I want to start today's lecture by reminding you of what we did last time,
and then we will switch to a very, very simple example where I think it's even the simplest
possible example for reinforcement learning. Okay, let us briefly go through the setting again.
The setting is shown here. We have some agent, which in the end will be our computer program,
but as a physical object it could be a robot or a car. And that agent is called like that because
it can do some things, even if it maybe only just moves around. And whenever it does so,
it changes the overall state of its environment. And at the same time, it's also able to observe
the environment. So there are actions of the agent affecting the state of the environment,
and the state of the environment being observed by the agent. The challenge is, as I said,
there is no teacher, so we don't do supervised learning because maybe we don't even know ourselves
what would be the correct action. If you think of a complicated game, maybe you don't know how to
play this game optimally. So then what to do is to use reinforcement learning. That is to train,
for example, a network to produce actions, to tell us which actions to take based on these
rare rewards. And this is the typical scenario that you should imagine to have something concrete
in mind. You have some player, some character that moves around, and which sometimes collects
a reward by doing certain things. And this very simple example, just because it passes across any
of these green points that represent some treasures. And in the end of the game, maybe after 100 time
steps, you are told the reward. And now you have to do something based on this reward. You have to
decide how to change your actions in the future. And so the simplest way to do this is called policy
gradient or reinforce. That's already pretty old. It's the simplest model free general reinforcement
technique. And the basic idea is surprisingly simple. The basic idea is this, that first of all,
it's a probabilistic model. So you have to decide on the probabilities of taking various actions,
and then you just throw dice. And then you do the action, and then again and again. So it's a
probabilistic model for your agent. And the purpose of a neural network in this context would be to
take the current state of the environment and to predict all these probabilities for the actions.
So now then, how do you teach this network? Well, you take the reward in the end, and if it turns
out to be high, if it has been a good trajectory, so to speak, then you try to make all the actions
in this trajectory more likely in the future. You increase the probabilities for all the actions
that were actually taken. This will also include some stupid actions, of course. But since they
will occur more likely in the trajectories that have a lower reward, overall you still do the
right training in this way. Okay, so to make it more concrete, what you are given is a state S,
which could be this entire picture, this entire map. What you want to know is the probabilities
for various actions A, which in that case could be the four directions in which you could move.
And for each of these four directions, you want to calculate the probability to go along this
direction. So this pi theta of A given S is called the policy. It's a probability distribution over
the actions given the state. And the theta here just represents generally the parameters of the
network, for example, that calculates this policy. Or maybe you don't even have a neural network,
you parameterize it in a different way. But these are the things that you want to change,
that you want to update during training. Okay. And then we said after you take the action,
you land in a new state. Maybe you just moved up one step. Possibly the transition to this new
state even contains some probabilistic element, because maybe the environment itself is a little
bit noisy. Or it could be purely deterministic. So that doesn't really play a role for the concept.
And then you can calculate what's the overall probability for a given trajectory tau that
consists of all the sequence of states and actions that you take. And of course, you get it by just
multiplying together the probabilities for the various steps. So the probability to take an
action given a state, and the probability for the environment then to make a transition from the old
Presenters
Zugänglich über
Offener Zugang
Dauer
01:28:08 Min
Aufnahmedatum
2017-07-10
Hochgeladen am
2017-07-11 11:16:25
Sprache
en-US