Welcome back to deep learning. So today we want to discuss the basics of reinforcement learning.
We will look into how we can teach a system to play different games and we will start with a very first introduction into sequential decision making.
So here I have a couple of slides for you. So you see that the topic is reinforcement learning and we want to go ahead and talk about sequential decision making.
Later in this course we will talk also about reinforcement learning and all the details and we will also look into deep reinforcement learning.
But today we will only look into sequential decision making.
Okay, sequential decision making. Well, we wanted to play a couple of games and the simplest game that you can potentially think of is that you just pull a couple of levers.
And if you try to formalize this then you end up in the so-called multi-armed bandit problem.
So let's do a couple of definitions. We need some actions and we formalize this as choosing an action A at time t from a set of actions capital A.
So these are a discrete set of possible actions that we can do. And if we choose an action then this has some implications.
And if you choose that action A t then you will generate some reward R t. But the relation between the action and the reward is probabilistic.
Which means that there is a probably different unknown probability density function that describes the actual relation between the action and the reward.
So if you think of your multi-armed bandit, you have a couple of slot machines, you pull one of the levers and then this generates some reward.
But maybe all of these slot machines are the same and probably they are not.
So each arm that you could potentially pull has a different probability of generating some reward R t.
Now you want to be able to pick an action and in order to do so we define a so-called policy.
So the policy is a way of formalizing how to choose an action and the policy is essentially also a probability density function that describes us the likelihood of choosing some action.
And the policy is essentially the way how we want to influence the game. So the policy is something that lies in our hand and we can define this policy.
And of course we want to make this policy optimal with respect to playing the game.
So what's the key element? Well what we want to achieve? We want to achieve a maximum reward.
And in particular we not just want to have the maximum reward for playing the game in every time step, but instead we want to compute the maximum expected reward over time.
So we produce an estimate of the reward that is going to be produced and we compute a kind of mean value over this because this allows us to then estimate which actions produce what kind of rewards if we play this game over a long time.
So this is a difference to supervised learning because here we are not saying do this action or do that action, but instead we have to determine by our training algorithm which actions to choose.
And obviously we can make mistakes and the aim is then to choose the actions that will over time then produce the maximum expected reward.
So it's not so important if we lose in one step if we then on average still can generate a high average reward.
So the problem here is of course that the expected value of our reward is not known in advance.
So this is the actual problem of the reinforcement learning that we want to try to estimate this expected reward and the associated probabilities.
So what we can do is we can form R as an one-hot encoded vector which reflects which action of A has actually caused the reward.
If we do so we can estimate the probability density function online using an averaging and we introduce this as the function Q of A and this is the so-called action value function which essentially changes with every new information that we observe.
So how can we do this? Well there is an incremental way of computing our Qt of A and we can very easily show this.
We defined Qt as the sum over all the time steps so Qt plus one of A equals the sum over all the time steps t and the obtained rewards and of course you divide by t.
Now we can show that this can be split up so we can take out the last element of the sum which is RT and then only have the sum run from one to t minus one.
And if we do so we can then also introduce the term t minus one because if you introduce t minus one here and divide by one over t minus one this will cancel out to one.
So this is a perfectly fine statement. Then we see that on the right hand part we essentially have nothing else than Qt of A so the previous Qt of A.
So this is something that we have already determined and then we can rearrange this a little bit and in the last line you can see that we can essentially update our Qt of A plus one over t times RT minus Qt of A.
So we can have incremental formulation of the update of our action value function and this is very useful because then we don't have to store all of the rewards but we can do this in an iterative way.
Now if we go ahead then we can think about all of the different steps that we have to do in order to train such a system and you end up with something that is called the exploration exploitation dilemma.
So of course we want to maximize our reward and we try to choose our policy in a way that we have the maximum reward that we can generate from our action value function.
And this would work if you already have a good plan. So if you already know how things are.
So let's say you're looking for the recipe of a pizza and you already have seen a couple of pizzas that you already like then you can produce the same pizzas again and you know this recipe is good so I will get my reward.
And this would be deterministic if you use the so-called greedy action selection where you always choose the maximum.
Where you always produce the same actions giving the same input. And then this leads to the problem that we also need to sample our rewards because we don't know what the best configuration is because we are just exploring.
So if we only follow the greedy action selection then we will never make any new observations. We will always produce the same pizza.
And maybe there is a much better pizza. Maybe there is a pizza that we like much better. So this can we only find out if we explore.
So every now and then we have to try a new recipe because otherwise we don't know how we should mix the ingredients and probably find a better pizza than the one we already had.
And this means that at occasions we have to do moves that will not produce the maximum reward but from not producing a good reward we can at least learn that this particular move was not a very good one.
So if you train such a system then in the beginning you want to focus more on the exploration, figuring out which moves are good and which moves are not so good. And then later you can then exploit more and more in order to focus on strategies that have worked in certain situations.
So how can we do that? Well a very simple form of sampling actions is the uniform random distribution. So in the uniform random distribution you don't need any knowledge about the system.
You just pick actions randomly and you choose them with equal probability. So you simply select some action and all of the actions are equally likely and the likelihood for every action is one over the cardinality of the set of possible actions.
Well this might be good for exploration but you probably will make a lot of mistakes. So there is something that is a little better. This is called the epsilon greedy approach.
So here you choose actions giving this probability density function and you can see that we choose 1 minus epsilon. So let's say epsilon is 0.1. Then we choose at 90% probability the action that maximizes our action value function.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:18 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 20:06:19
Sprache
en-US
Deep Learning - Reinforcement Learning Part 1
This video explains the concepts of sequential decision making and the multi-armed bandit problem.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning