Okay, so let's start.
This sounds loud.
Well, okay.
We are in the last reasoning under uncertainty chapter.
Remember, before yesterday we looked at things like Markov chains, dynamic Bayesian networks,
hidden Markov models and so on.
And if you think about what we've been doing, they're just the modeling component,
not the decision component.
Always if we want to do decisions, we want to do this maximizing expected utility stuff.
All of that happens here.
That's what we're about now.
Some kind of a Markov chain or HMM or whatever happens here and kind of deals with the percepts
and how the world changes over time and so on.
And I wish I had colored chalk, but I don't.
So I'm going to kind of make a circle.
We still have the Markov something or the other.
But that's the important thing now, making decisions.
That's something all of these areas actually share.
Search is all about making decisions.
What's in this case, typically heuristically, the next move or the plan that we're going
to pursue.
And we can do it kind of the easy way last semester.
We can add in uncertainty and utility, which gives us something we're not covering in this
course, but which is something you're very well equipped to understand by now.
And then we have what we will kind of do next.
Markov decision problems, MDPs, partially observable MDPs, POMDPs.
So that's kind of the new thing here.
And the example was this little four times three world, which kind of lives from the
fact that instead of directly having utility, we kind of sum up rewards.
Under actions that are non-deterministic.
In this case, we still have full observability, which means we're still up here.
OK?
POMDPs come next.
So we're doing the slightly easier thing first as a warm-up exercise.
Here's the definition.
We have states.
We have transition models.
We have actions, just as before.
But we have a reward function, that per action gives you a reward or a punishment.
And given that, and given that the utility is not something that is given for a state,
we can do that maybe via Ramsey's theorem, where we look at state preferences.
Here the utility actually comes from the accumulated rewards, which changes the whole thing.
That's the generalization we will look at now.
And we've seen that this is an interesting area, because it kind of leads to interesting
agent behaviors, without having to change the overall mechanisms, we hope.
OK?
So utilities over time.
The idea really is that instead of having utilities of states, we have to consider utilities
over state sequences, which measures very well with the notion of a sequential environment,
which basically has the most general notion of time.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:25:05 Min
Aufnahmedatum
2025-05-28
Hochgeladen am
2025-05-29 18:39:05
Sprache
en-US