10 - Deep Learning [ID:12610]
50 von 698 angezeigt

Okay, welcome back to deep learning. Happy New Year everybody. It's 2020. So we still

have quite a few exciting lectures coming up this year and today we want to start with

deep reinforcement learning. So today we finally look into how we can train artificial intelligence

systems trying to play games automatically. So this is the idea of deep reinforcement

learning. And in order to introduce that, we will first have a look at sequential decision

making because this is the kind of, we are not facing a single decision right now, but

we want to have a sequence of decisions and these decisions in the beginning they will

be independent, but of course when you're playing a game, every decision has effect

on future decisions and also the future reward. Then we will introduce reinforcement learning.

This is the part where we then really start looking into decisions that are dependent

on each other and how to learn them. Therefore we will introduce the Markov decision process.

So Markov decision processes are essentially key to understand the concepts of reinforcement

learning. Then we talk about policy iteration and other solution methods, how to actually

train those systems. And in the end of the lecture we want to talk about deep reinforcement

learning. So this is now then the deep learning version of reinforcement learning and there

we want to talk a bit about deep Q learning and AlphaGo and AlphaGo Zero. So that's about

the outline of today's talk. So first we start off very basic, very simple with sequential

decision making. And you'll see we'll add on step by step until we can really formulate

an entire game as a learning process. So now if you do sequential decision making and we

want to start with a very simple game, we can start with the so-called multi-armed bandit

problem. So here we have always the constellation that we have some action and this action A

is taken at a time t from a set capital A. So there's a set of actions that you can take

and you choose one of these actions. So there's only a limited number of moves that you can

do at every time t and for every action there's a kind of a reward. Now if you want to keep

this simple we can stay in this constellation like the multi-armed bandit problem where

you can essentially choose one of the arms on one of the slot machines where you can

then pull the lever and then you get some reward. And because we don't know what happens

inside the slot machine we have every action has a different but to us unknown probability

density function that is generating some reward r at time t. So the reward is dependent on

which action we take and this then is what we seek to maximize, right? We want here in

this case we want to make as much money as possible using those slot machines. So in

order to do that we have to define a policy and the policy is what helps us to choose

an action at a certain, yeah, so the policy is essentially telling us which actions to

take. And we can also formulate this as a probability density function. So now we have

actions, we have rewards, and the policy is a probability density function that tells

us which of the actions we want to take. So now we want to win, right? We want to create

the maximum reward over time. So what we want to find is because we don't, for every action

we can't tell the future but we can compute an expected value for that action. So we like

to seek the action that maximizes the expected reward over time. This is what we want to

take. We want to take the action that generates the maximum expected reward over time and

there's a difference to supervised learning. There is no immediate feedback on what action

to choose. We just have this reward and only after we observe those rewards we can actually

see whether this was a good action or not. In supervised learning we would have a class,

you would formulate it in a different way, you would try to identify features and then

classify into one or several classes. Here we don't do that, here we immediately have

the reward and the reward tells us whether it was a good action or not. That's quite

interesting because with these rewards we can then model a lot of situations and we

will see that in the next couple of minutes how to do that. One big problem is of course

this expected value, this is expectation of the maximum reward is not known in advance

so we somehow have to estimate it. So this is one of the key problems and we could form

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:25:55 Min

Aufnahmedatum

2020-01-07

Hochgeladen am

2020-01-07 19:09:03

Sprache

en-US

Tags

breininger function reinforcement state atari policy value greedy bellman
Einbetten
Wordpress FAU Plugin
iFrame
Teilen