2 - 26.2. Approximations of Bayesian Learning [ID:30389]
50 von 79 angezeigt

Okay, but what else can we do? Obviously what we want to do is we want to get rid of the

summation. That's our problem. These are just too long. So if we only want to make predictions

with respect to the most probable hypothesis, we can do something else. So what we can do

is we can actually look at the most probable hypothesis and directly use that. Instead

of summing up over all the hypothesis to get the real picture, we'll just sum up over the

most probable. And we hope that this gives us good values. We have to lose somewhere.

And here we're gaining big. So we turn the whole thing into an optimization problem and

we have to only find out this most probable hypothesis, which we can do by a maximization

argument. We're maximizing over all hypothesis given the data. And we can do that either

by maximizing this or even better, by maximizing the logs. Why is that good? Well, we still

have a multiplication here. We don't like multiplication. We like addition much better.

So if you wrap a log around it, then we only have to add. And this is not going to destroy

anything, because generally if we maximize, we can just kind of basically wrap any function

that is, for instance, monotonic around it and log it. That's just a rescaling. What

you can't do is wrap such a function around it. So what we're going to do, and that's

the standard trick, is we're going to go to log maximization. The observation is that

if we have enough samples, this MAP learning turns out to be Bayesian approximately. And

if we go back to this thing here, if this is somehow correct, if this somehow shows

us what's actually going on, then we'll converge on a most likely hypothesis, while all the

others kind of die down eventually. So the prediction of Bayesian learning is that we

have, the best hypothesis case is correct after a while. And if we must actually approximate,

so assuming that after a while we have a good hypothesis, the most probable one, is actually

a good idea. So what you have is that if you have a deterministic

hypothesis, then you have a deterministic situation, as you have in science very often.

If you're trying to predict or learn the current from the voltage and the resistance,

then there's a deterministic and non-statistical relationship between those two. We actually

have in the limit, the MAP becomes the actual deterministic hypothesis. So we're in the

best of both worlds. We're becoming full Bayesian, and in deterministic cases, we actually find

the right one. And for computer scientists, of course, getting

rid of these huge sums in favor of an optimization problem is something we like. Optimization

is easy, you just do gradient descent or something like that. Or even if you have symbolic solutions,

you can just do it by partial differentiation and so on.

There's a wrinkle, which I would like to come back to, and just to basically make you aware

that there is a connection here. You remember when we were combating, overfitting a while

back, we had this idea that we would put kind of a size term into the optimization problem

so that during optimization we would also bias our search, our optimization, towards

simple solutions, small solutions. That kind of has an implementation of Occam's razor.

That was called regularization. In regularization, we had a special case where we basically used

the information content of a hypothesis. The idea was we typically had to have this proportionality

factor to align the scales of optimizing for a solution, for a maximum, and optimizing

for regularization. The idea was that if we could express both of them at the same scale,

then we don't need this factor, which we didn't know how to pick anyway. That came down to

this idea of minimum description length, where the idea was, well, if we can just basically

look at the information content of the solution and the hypothesis and turn both of them into

Turing machine programs, then this would theoretically be beautiful. It actually works in practice,

somewhat surprisingly. Since we're logging here anyway, we're not that far off from using

minimal description length. There is a variant of learning which uses exactly the same ideas

as we had before for regularization. That's called minimum description length learning.

In our example with the candies, it predicts exactly the right thing. We're minimizing

the information entropy of the hypothesis. The idea is exactly as always, small or simple

Teil eines Kapitels:
Chapter 26. Statistical Learning

Zugänglich über

Offener Zugang

Dauer

00:14:58 Min

Aufnahmedatum

2021-03-30

Hochgeladen am

2021-03-30 17:28:02

Sprache

en-US

Maximum A Posteriori Approximation and Maximum Likelyhood Approximation for Bayesian Learning. 

Einbetten
Wordpress FAU Plugin
iFrame
Teilen