Okay, this seems to be working.
Okay, hello everybody.
As mentioned yesterday, Professor Kuhl has isn't here this week, so you'll have to, I
will have to do.
More today, I think I was a bit quick yesterday, I tend to do that, especially if I'm not
the one who actually did this slide.
So I figured I'll start by just going over everything we did yesterday, again just summarise
the key results.
Okay, so first off, model selection.
We do have a trade off if we want to do learning with respect to on the one side, the
complexity of the kinds of models that we produce.
The more complex the model, the more specific the model, the more likely it is that it is
actually going to fit all of the data that we have.
The trade off is that the more precise the model is the more likely it is that we're
going to overfit on the data.
There are various ways to solve that problem.
The first is if our hypothesis base is such that we can order it by some notion of complexity.
For example, if we do polynomial fitting by the degree of the polynomial or if we do something
like decision tree learning by the depth of the decision tree or the number of notes or
whatever, then what we can do is we can just iterate over that notion of size, do learning
for every individual size, keep track of both the training and the validation errors, run
that thing until the training error has converged and then go back and pick the hypothesis
where the evaluation error tends to be the lowest.
Here's how that works if we do a k-fold cross validation.
So we iterate, we partition the set according to k-fold cross validation, we iterate
over some notion of size, do training for each of them, keep track of the training errors
and the validation errors and then once the training error has converged, we go back
and pick the model with the minimal validation error.
Here's one example where we would do that with decision trees on the restaurant example.
So we iterate over the size of the tree, doing breadth first search for the decision tree
construction, run that until training error has converged, which apparently it will do
with something like generation 10 and then we go back and pick the one where the validation
error is minimal in this case after generation 7.
That's one way to do things, by the way if there are any questions just raise your hands.
Alternatively what we can do, so far we've mostly just minimized the error rate that of
course works but it also means that we treat all errors that our hypothesis produces the
same.
Usually that's not exactly what we want to do, usually some errors are more egregious than
others.
So the alternative to just counting errors and taking the fraction of those is we introduce
a loss function that allows us to more precisely measure how egregious a given error is.
That allows us for example to distinguish between false positives and false negatives in
cases where like false positives are worse than false negatives so we can penalize one
of them stronger than the other ones.
Some popular loss functions are absolute value loss or almost equivalently squared error
loss, squared error loss does have the advantage that if you think of it in terms of derivatives
the derivative is not just a constant but it's actually linear i.e. if we do things like
gradient descent we're actually going to penalize larger discrepancies in the error stronger
than smaller ones in general in the following we will almost always just use the squared
error loss as a reasonable default.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:25:33 Min
Aufnahmedatum
2023-06-07
Hochgeladen am
2023-06-12 18:59:17
Sprache
en-US