16 - Artificial Intelligence II [ID:47307]
50 von 1086 angezeigt

Okay, this seems to be working.

Okay, hello everybody.

As mentioned yesterday, Professor Kuhl has isn't here this week, so you'll have to, I

will have to do.

More today, I think I was a bit quick yesterday, I tend to do that, especially if I'm not

the one who actually did this slide.

So I figured I'll start by just going over everything we did yesterday, again just summarise

the key results.

Okay, so first off, model selection.

We do have a trade off if we want to do learning with respect to on the one side, the

complexity of the kinds of models that we produce.

The more complex the model, the more specific the model, the more likely it is that it is

actually going to fit all of the data that we have.

The trade off is that the more precise the model is the more likely it is that we're

going to overfit on the data.

There are various ways to solve that problem.

The first is if our hypothesis base is such that we can order it by some notion of complexity.

For example, if we do polynomial fitting by the degree of the polynomial or if we do something

like decision tree learning by the depth of the decision tree or the number of notes or

whatever, then what we can do is we can just iterate over that notion of size, do learning

for every individual size, keep track of both the training and the validation errors, run

that thing until the training error has converged and then go back and pick the hypothesis

where the evaluation error tends to be the lowest.

Here's how that works if we do a k-fold cross validation.

So we iterate, we partition the set according to k-fold cross validation, we iterate

over some notion of size, do training for each of them, keep track of the training errors

and the validation errors and then once the training error has converged, we go back

and pick the model with the minimal validation error.

Here's one example where we would do that with decision trees on the restaurant example.

So we iterate over the size of the tree, doing breadth first search for the decision tree

construction, run that until training error has converged, which apparently it will do

with something like generation 10 and then we go back and pick the one where the validation

error is minimal in this case after generation 7.

That's one way to do things, by the way if there are any questions just raise your hands.

Alternatively what we can do, so far we've mostly just minimized the error rate that of

course works but it also means that we treat all errors that our hypothesis produces the

same.

Usually that's not exactly what we want to do, usually some errors are more egregious than

others.

So the alternative to just counting errors and taking the fraction of those is we introduce

a loss function that allows us to more precisely measure how egregious a given error is.

That allows us for example to distinguish between false positives and false negatives in

cases where like false positives are worse than false negatives so we can penalize one

of them stronger than the other ones.

Some popular loss functions are absolute value loss or almost equivalently squared error

loss, squared error loss does have the advantage that if you think of it in terms of derivatives

the derivative is not just a constant but it's actually linear i.e. if we do things like

gradient descent we're actually going to penalize larger discrepancies in the error stronger

than smaller ones in general in the following we will almost always just use the squared

error loss as a reasonable default.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:25:33 Min

Aufnahmedatum

2023-06-07

Hochgeladen am

2023-06-12 18:59:17

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen