44 - Recap Clip 8.6: Using Information Theory (Part 2) [ID:30447]
50 von 64 angezeigt

We're still in the intro phase to machine learning and we are essentially looking at

a problem that plagues all the learning paradigms, which is the problem of overfitting versus

underfitting.

Machine learning is an optimization process and we're trying to optimize essentially

as an agent what we do with respect to this external performance measure.

At the moment we're looking at machine learning from examples.

There's always, depending on what kind of examples we get, which only partially mirror

the underlying processes we really want to learn.

There's the tendency that instead of learning the underlying processes, we're just learning

the examples we've seen so far.

Which of course is only fair because that's the only thing we're given.

The question is can we do something about overfitting?

Can we do something about underfitting?

Underfitting we can normally solve relatively easily by giving or looking at more examples

if we have them.

Underfitting can also sometimes be cured by more examples because they will naturally

generalize the behavior.

But we want to do something about that actively.

So we want to sometimes generalize our solutions by just looking at them in and of themselves.

That sometimes helps.

We looked at one example there which was decision tree pruning.

Which was, remember decision tree learning found out a nice decision tree.

The question is can we make that better?

The idea here is that we'll go through the terminal nodes of that tree and for every

one of those look at whether it offers us enough information gain.

And the really only question there is, because we can compute the information gain already,

the only question is how much is enough to think of a node as irrelevant?

The statistics have an answer for us by just using standard significance tests.

And the idea for these significant tests is that we want to use the information gain kind

of as a measure and we want to distinguish the information gain a node gives us with,

well compare to, instead of distinguish I must say, compare the information gain that

a node, a particular terminal node gives us to the information gain we're expecting with

the null hypothesis, namely that everything is just random.

And if it's sufficiently near, for some value of sufficiently near randomness, we can say

this node doesn't give us enough information so we throw it out.

And we can make our decision trees smaller and of course smaller decision trees make

less decisions so possibly they generalize better.

That's the idea.

And then there are some standards, tricks of the trade.

For instance, looking at these, looking at the errors in a kind of a squared, sum of

squared errors way and we know something about how sum of squared errors should be distributed

and comparing to this function how squared errors should be distributed will give us

a measure.

And by some statistics voodoo we know that if this quantity up there is of that size

then a statistician would say it's significant.

Don't erase me.

Okay?

That's the idea.

What we're seeing here, I mean what we've been seeing the whole semester is that AI

has been on a huge shopping tour everywhere into logic, into probabilities, into statistics,

Teil eines Kapitels:
Recaps

Zugänglich über

Offener Zugang

Dauer

00:07:23 Min

Aufnahmedatum

2021-03-30

Hochgeladen am

2021-03-31 11:16:38

Sprache

en-US

Recap: Using Information Theory (Part 2)

Main video on the topic in chapter 8 clip 6.

Einbetten
Wordpress FAU Plugin
iFrame
Teilen