3 - 24.2. Utilities over Time [ID:30358]
50 von 168 angezeigt

Right. Now, of course, for all of this, how can we properly treat these kind of agents?

The idea will, of course, be to take the individual rewards, assign all of them a utility as a

sequence as a whole, and then try to compute them given preferences again. You're used

by now to preferences, I assume. You're also somewhat used to utilities. Now the interesting

thing will be how do we properly assign meaningful utilities to sequences of actions such that

an agent or such that one of those utility functions actually yields an optimal policy

in the sense that we want it to be. So basically, we need to be able to prefer sequences of

states over other sequences of states. Now one rather meaningful assumption we can do

there is stationarity, meaning if I have two sequences of actions, both of which start

with the same initial sequence, it doesn't have to be the single R here, it could be

a whole sequence, but of course that is reducible to this case, and then it's probably safe

to assume that if I prefer this sequence to this sequence, then I should also prefer this

sequence to this sequence and the other way around. One different way to put this is what

I'm going to do in the future and how useful it is to do that does not depend on how I

got where I am now. Basically, if I'm hungry and I'm thinking of whether to eat this or

whether to eat that, it shouldn't matter why I'm hungry, it just matters that I'm hungry.

And as I said with additive rewards and so on, we're going to modify this slightly because

it turns out that if you have stationary preferences in this sense of the word, then there's only

two ways to compute utilities on sequences of rewards that actually take that into account.

And in fact, one of those is reducible to the other, so in fact you only have one way,

and that is by just taking the rewards for each of the states in your sequence and taking

a discount factor in front of them. That's often seen as some sense of the discount factor

basically telling you how much you value instant gratification over long-term gratification.

It's probably a useful way to think about that, but of course it should be taken with

a grain of salt. So the idea being the reward I will get immediately by doing some action

counts higher than the reward I will get, I don't know, next week after doing the next

seven steps in my sequence. And why can we reduce additive rewards where I don't have

any discount factor at all? How can I reduce that to the below case? That should be rather

trivial. Yes? Yeah, exactly. I just pick a discount factor of one. So basically by saying

two ways to combine rewards, it's basically just one, just with one notable special case.

Now of course this is a problem. At least it's a problem if I assume that time will

keep on going and not just stop at some time. Why? Because if I want to have a policy that

basically can look infinitely far into the future, then what I end up with is infinite

sequences of rewards or infinite sequences of actions or infinite sequences of states.

And now if I want to compute a utility on top of that, I run into the problem that I

get infinite series and infinite series don't necessarily converge. Which is somewhat an

annoying. There's three ways to go about fixing that. The first way is rather trivial. I just

basically say time stops in ten years or something. Reasonable assumption because usually the

situations that I want to model are only relevant for some given time space, time interval that's

may put it like that. It's called the finite horizon assumption. So just keep that keyword

in mind because we're going to use it a lot. Another alternative is to add absorbing states.

We've had those in our example. We can think of the terminal state plus one and minus one.

We can think of those as absorbing states in the sense of once our agent ends up in

one of those two states, the whole model is basically finished. The problem there is if

that is to solve our problem with possibly infinite states that we want to assign utilities

to, then we have to be able to guarantee that our agent will always end up in one of those

terminal states no matter what sequence of actions it might take. Or we can just consider

those kind of actions or sequences that will actually end up there. Which is not always

the case. For example, if my reward function for just staying alive is positive, you already

noticed it might not ever actually end up in one of those. So if we try to compute an

Teil eines Kapitels:
Chapter 24. Making Complex Decisions

Zugänglich über

Offener Zugang

Dauer

00:21:54 Min

Aufnahmedatum

2021-03-29

Hochgeladen am

2021-03-30 14:57:26

Sprache

en-US

Utilities of state sequences and states are explained. 

Einbetten
Wordpress FAU Plugin
iFrame
Teilen