20 - Artificial Intelligence II [ID:52592]
50 von 726 angezeigt

Whoa, that was louder than I expected.

Then let's quickly recap the section on reinforcement learning we talked about last time.

This will be a quick recap because basically most of what happens in reinforcement learning

except for the deep dive details which we're not going to go into anyway is things we've

already done with respect to Markov decision procedures.

The main difference being that now of course we don't even know the transition model and

as a consequence we can't do things like offline solving for an optimal policy.

All of that stuff is basically right out of the window anyway.

So what's the best we can do?

Well, the best we can do is basically just try a couple of things and try to somehow

form an informed average of the utility of various things depending on what we tried

out before, i.e. trials.

We will largely look at passive learning first where we assume that we have a fixed policy,

i.e. we know exactly what to do in every particular state and then we just try to figure out how

well that policy actually behaves with respect to the expected utility on the basis of the

transition model which we're learning in the process anyway.

And the reason we're doing that is because subsequently we can do active learning by

just adding the act of acting to the algorithm, if that makes sense.

So the usual Markov decision procedure stuff, so what we do is we just run a bunch of trials.

We know what to do in every state, so we can literally just execute that policy in

the starting state, see where we end up since all of our transitions are probabilistic anyway.

We don't actually know where we're going to end up.

Each of those we can take as a piece of evidence for the utility of every state we walk through

in any particular trial.

And then we can do the naive thing which is direct utility estimation which means we just

take as the utility of that state the expected utility of that state given the trials that

we've done as just the average value over all of the trials that we've done where we

ended up in that particular state.

Okay, so we literally just try where we end up, look at what the expected total reward

of that particular trial is and then take as utility the average of all of those.

Fair enough.

Obviously that does not make use of information we otherwise basically have which is that

we know that the utilities between states are connected namely by virtue of our probability

of ending up in certain successor states given what we do in a particular state and we know

that that relation between the utilities of states observes the Bellman equation.

So what we can do is adaptive dynamic programming which is literally just using the Bellman

equation directly to update the utilities of the particular states in each trial rather

than just naively taking the average value.

That converges extremely slowly naturally but in the kind of setting where you would

have to use reinforcement learning basically everything converges slowly.

So here we have the consequent adaptive dynamic programming learning algorithm for passive

reinforcement learning and then what we do for active reinforcement learning is instead

of assuming that we have a fixed policy, is that a question?

Okay, yeah.

What is the use for calculating the utility for a fixed policy you mean?

Literally just that we know how to do that so that we can subsequently modify it such

that we don't have a fixed policy anymore.

So what we do in active reinforcement learning is we literally take what we did for passive

reinforcement learning and act the act of acting in there as well.

Apart from that it's literally the same algorithm.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:22:48 Min

Aufnahmedatum

2024-07-02

Hochgeladen am

2024-07-05 02:19:43

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen