We're still in the intro phase to machine learning and we are essentially looking at
a problem that plagues all the learning paradigms, which is the problem of overfitting versus
underfitting.
Machine learning is an optimization process and we're trying to optimize essentially
as an agent what we do with respect to this external performance measure.
At the moment we're looking at machine learning from examples.
There's always, depending on what kind of examples we get, which only partially mirror
the underlying processes we really want to learn.
There's the tendency that instead of learning the underlying processes, we're just learning
the examples we've seen so far.
Which of course is only fair because that's the only thing we're given.
The question is can we do something about overfitting?
Can we do something about underfitting?
Underfitting we can normally solve relatively easily by giving or looking at more examples
if we have them.
Underfitting can also sometimes be cured by more examples because they will naturally
generalize the behavior.
But we want to do something about that actively.
So we want to sometimes generalize our solutions by just looking at them in and of themselves.
That sometimes helps.
We looked at one example there which was decision tree pruning.
Which was, remember decision tree learning found out a nice decision tree.
The question is can we make that better?
The idea here is that we'll go through the terminal nodes of that tree and for every
one of those look at whether it offers us enough information gain.
And the really only question there is, because we can compute the information gain already,
the only question is how much is enough to think of a node as irrelevant?
The statistics have an answer for us by just using standard significance tests.
And the idea for these significant tests is that we want to use the information gain kind
of as a measure and we want to distinguish the information gain a node gives us with,
well compare to, instead of distinguish I must say, compare the information gain that
a node, a particular terminal node gives us to the information gain we're expecting with
the null hypothesis, namely that everything is just random.
And if it's sufficiently near, for some value of sufficiently near randomness, we can say
this node doesn't give us enough information so we throw it out.
And we can make our decision trees smaller and of course smaller decision trees make
less decisions so possibly they generalize better.
That's the idea.
And then there are some standards, tricks of the trade.
For instance, looking at these, looking at the errors in a kind of a squared, sum of
squared errors way and we know something about how sum of squared errors should be distributed
and comparing to this function how squared errors should be distributed will give us
a measure.
And by some statistics voodoo we know that if this quantity up there is of that size
then a statistician would say it's significant.
Don't erase me.
Okay?
That's the idea.
What we're seeing here, I mean what we've been seeing the whole semester is that AI
has been on a huge shopping tour everywhere into logic, into probabilities, into statistics,
Presenters
Zugänglich über
Offener Zugang
Dauer
00:07:23 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-31 11:16:38
Sprache
en-US
Recap: Using Information Theory (Part 2)
Main video on the topic in chapter 8 clip 6.