The following content has been provided by the University of Erlangen-Nürnberg.
OK, hello, good evening everyone. So today I wanted to finish what we said about reinforcement
learning. In particular, I want to remind you of Q-learning and then show you one example
where people have applied Q-learning. And then I want to switch to something different,
which is the connection between these neural networks and spin models and physics and how
that connection can be exploited in something that is called a restricted Boltzmann machine.
So there you will at least be reminded of your statistical physics lectures or maybe
you will for the first time learn about the Boltzmann weights.
Okay, but now first let's discuss Q learning. I mentioned that the idea is that at each
state in which our world can be in, in which our player can be in, there's a function Q
that tells us how good is the quality of different actions that we could take. And then of course
the optimal strategy, if you know the right Q, is just to take the best action, that is
the action with the highest Q. And the task of course that makes everything difficult
is how to learn this unknown function Q. Here's an example. If our player sits at this spot
and its goal is to collect as many of these green dots as possible, then probably the
Q function that tells us how good it is to move in the different directions would be
maximal for moving upward because that's easiest to collect another green dot. And then we
said how in principle you would define such a Q function. You would define it as the expected
future reward given the current state and given the action that you take. This future
reward is either just the sum of all rewards from now until the end of the game, or maybe
you introduce this discount where you say I care most about immediate rewards and not
so much about later rewards, which makes it a little bit easier to train. And then we
found that there is such a thing as a recursive equation that tells us how this Q function
should be defined in principle. So the Q function for this particular step is the reward that
we would get plus the discount factor times the Q function of the next optimal step that
we take one step afterwards. Now this is not yet enough because it's just a recursive definition
and we don't know the right hand side, so we cannot calculate the left hand side. But
we can set up an iterative scheme that should converge eventually to the correct Q function.
And this is the one that is shown here, that we take the old one and add to that a small
increment, where the increment is chosen such that it would be 0 if we are already at the
correct Q function, but if not, then it will take us towards the correct Q function. And
so this is an equation that you can work with. And in principle, if the state space and the
action space are sufficiently small, you could even explicitly have a table of all your Q
function values. For all possible S and all possible A, you just keep a long table in
the memory of your computer, and that's the Q function, and that you will always update
this table according to this equation. In reality, maybe the state space is exponentially
large because the state may be a whole picture that you see on a screen, and then each pixel
can have different values, so the number of possible pictures is exponentially large,
so there is no question to try to make the Q function in a big table because the table
would simply be too large. And then it's much better to represent the Q function by a neural
network. So the input of the neural network would be the current state, like a picture,
and the output would be the Q values for the different actions that you can take, assuming
that there are only very few actions, only going up south or west or east, for example.
Okay. So here, just to visualize what this means, imagine I'm walking around on a grid.
The red dot is where I want to go, then I will get a reward, and my task is to go to
this red dot in the minimum number of time steps, for example, and then I will get a
high reward. And I'm trying to plot the Q function for the action to go up as a function
of the state S, where the state S is just the lattice site on which I am currently.
And initially, after one update, I would have something like the Q function on this state,
which is immediately below the target state. There, the Q function for going up is very
Presenters
Zugänglich über
Offener Zugang
Dauer
01:30:44 Min
Aufnahmedatum
2017-07-17
Hochgeladen am
2017-07-18 11:48:52
Sprache
en-US