Whoa, that was louder than I expected.
Then let's quickly recap the section on reinforcement learning we talked about last time.
This will be a quick recap because basically most of what happens in reinforcement learning
except for the deep dive details which we're not going to go into anyway is things we've
already done with respect to Markov decision procedures.
The main difference being that now of course we don't even know the transition model and
as a consequence we can't do things like offline solving for an optimal policy.
All of that stuff is basically right out of the window anyway.
So what's the best we can do?
Well, the best we can do is basically just try a couple of things and try to somehow
form an informed average of the utility of various things depending on what we tried
out before, i.e. trials.
We will largely look at passive learning first where we assume that we have a fixed policy,
i.e. we know exactly what to do in every particular state and then we just try to figure out how
well that policy actually behaves with respect to the expected utility on the basis of the
transition model which we're learning in the process anyway.
And the reason we're doing that is because subsequently we can do active learning by
just adding the act of acting to the algorithm, if that makes sense.
So the usual Markov decision procedure stuff, so what we do is we just run a bunch of trials.
We know what to do in every state, so we can literally just execute that policy in
the starting state, see where we end up since all of our transitions are probabilistic anyway.
We don't actually know where we're going to end up.
Each of those we can take as a piece of evidence for the utility of every state we walk through
in any particular trial.
And then we can do the naive thing which is direct utility estimation which means we just
take as the utility of that state the expected utility of that state given the trials that
we've done as just the average value over all of the trials that we've done where we
ended up in that particular state.
Okay, so we literally just try where we end up, look at what the expected total reward
of that particular trial is and then take as utility the average of all of those.
Fair enough.
Obviously that does not make use of information we otherwise basically have which is that
we know that the utilities between states are connected namely by virtue of our probability
of ending up in certain successor states given what we do in a particular state and we know
that that relation between the utilities of states observes the Bellman equation.
So what we can do is adaptive dynamic programming which is literally just using the Bellman
equation directly to update the utilities of the particular states in each trial rather
than just naively taking the average value.
That converges extremely slowly naturally but in the kind of setting where you would
have to use reinforcement learning basically everything converges slowly.
So here we have the consequent adaptive dynamic programming learning algorithm for passive
reinforcement learning and then what we do for active reinforcement learning is instead
of assuming that we have a fixed policy, is that a question?
Okay, yeah.
What is the use for calculating the utility for a fixed policy you mean?
Literally just that we know how to do that so that we can subsequently modify it such
that we don't have a fixed policy anymore.
So what we do in active reinforcement learning is we literally take what we did for passive
reinforcement learning and act the act of acting in there as well.
Apart from that it's literally the same algorithm.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:22:48 Min
Aufnahmedatum
2024-07-02
Hochgeladen am
2024-07-05 02:19:43
Sprache
en-US