10 - Deep Learning [ID:9275]
50 von 726 angezeigt

Today we want to talk about deep reinforcement learning.

And before we talk about deep reinforcement learning, we want to make clear what reinforcement

learning is.

And in order to line out what reinforcement learning is, we start with sequential decision

making, then we go towards reinforcement learning and towards the end of this lecture we will

talk about deep reinforcement learning.

So there's quite a bit of theory to cover and I think this will be a very interesting

lecture.

So let's talk a little about sequential decision making.

So imagine you have a situation in which you can decide for several actions and they return

a certain reward.

So you could have an action A at a given time t from a set of different actions.

So there's only a limited amount of possible actions.

Let's say you have one arm bandit and you have let's say four of them like in the picture

here and you can select one of these four bandits and you want to maximize your reward

and your action would be like pull the lever and then you win something or not.

That would be the reward.

So the action every action has a different but unknown probability density function for

generating a certain reward at a time t.

So the action generates a reward and in order to figure out what kind of action you should

choose, we can formalize choosing the action also as a probabilistic process where we have

a policy.

So the policy is essentially the distribution over the actions and if we follow a certain

policy, we want to be to have hints on how to select a specific action.

So the policy is used to select an action and every action then generates a certain

reward and the idea is now that we want to maximize the expected reward over time.

So there will be some reward and we want to maximize this reward over time and this is

not directly supervised learning because we don't have immediate feedback on what action

to choose because a single action is not yet the optimal one because we want to determine

a sequence of actions that is optimal.

So we don't know this expected value in advance.

This is the main problem that we have here but we can for example form a one-hot encoded

vector R which reflects which action from A caused a specific reward.

So with that I can then find a function Q here and the function Q is simply the total

reward over time.

So Q describes the reward over time for selecting certain actions.

So we call this and Q accumulates that over all past points in time.

So Qt, so the function at a given point time point t is called the action value function

which changes with every new information.

Every time I do an action I get a different reward and the function obviously memorizes

all of the previous rewards.

So essentially we have to know all of these rewards or we have to keep track all of those

rewards in order to compute this action value function.

Now this is quite a problem because then we would have to have the whole sequence of rewards

but we can reformulate this.

So you can see that we can essentially pull out the last element of the sequence out of

the sum then you get this formulation here.

Then you can introduce t minus one over t minus one so this would be just a factor of

one.

And if you look at this very closely you can see here you have nothing else but Qt of a.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:11:28 Min

Aufnahmedatum

2018-06-13

Hochgeladen am

2018-06-13 16:29:04

Sprache

en-US

Tags

reconstruction energy deep spatial pattern exercise recognition learning classification
Einbetten
Wordpress FAU Plugin
iFrame
Teilen