5 - Deep Learning [ID:12278]

50 von 575 angezeigt

Okay, so welcome everybody. It's great to see so many of you here. So I finally made

it back to the lecture. So I had a couple of issues regarding my health, but now everything

is back again, normal, so I can do the lecture again. So today's topic is going to be deep

learning and regularization. So far you've seen essentially how a neural network is constructed,

you've seen how you can compose different layers atop, you've seen universal approximation,

you discussed the different activation functions, how to compose such a network and how to train

it using back propagation algorithm. And today we want to look a bit into regularizing. So

we have this local gradient descent procedure and we want to make sure that we don't run into

troubles during the training and therefore regularization is an important technique that

is very widely applied. So let's discuss this issue a bit. So we have some data and in the

top left here you see a rather simple case. This case actually is linearly separable so

we would be able to even solve this one with a perceptron. But unfortunately our data in

real life typically does not look this way and we can easily separate the classes but

it's more complicated. So there's measurement noise in there, there's unclear examples,

there's cases where it's really hard to distinguish different classes. So what you typically can

expect is that there is no linear boundary that will separate your classes in the feature

space. So what do you do? Well, you somehow have to separate the classes and one thing

that you could do is you could just assume a linear decision boundary, right? So you

just draw a line and it's very nice because this line has very few parameters, it's rigid

and we can just separate the two classes but you will realize that the line may not be

a very good model to separate those two classes because you somehow very easily get rid, for

example, of the points down here so maybe you want to choose something else. The nice

thing with this line here however is that it has few parameters and you can very robustly

estimate this kind of model. What else you can do is you can of course increase your

model complexity, you can choose a model that can describe many, many different ways of

separating those two classes and what you may end up is something like this one. See,

you now have a decision boundary that is perfect. This one separates the two classes perfectly.

But the problem here is of course we introduce a lot of variance. You see that it's not very

stable and of course if I increase the complexity of my decision boundary I can essentially

find a model that will always separate my training data. So I can find these very, very

complex decision boundaries and they have a perfect fit. So this would result in a zero

loss because both of the classes are perfectly separated but we don't know how well this

will actually be applicable to a new data set. If you now draw from the same distribution

and let's assume that maybe this is a Gaussian distribution and this one then this decision

boundary may be not that good after all. So what you want to have in the end is something

that makes sense. You want to have a decision boundary that is kind of stable and that will

generalize so if you assume that we now would draw a new set of points this decision boundary

would probably generalize and we would still be able to use this decision boundary on unseen

test data. I mean training data is good and having a good performance on the training

data is great but in the very end you want to use your model on data you have not seen

before and therefore you need something that generalizes well. And this kind of problem

can be described also in a general sense and this results then in the bias variance decomposition.

So the bias variance decomposition can help us describe how we can trade off essentially

model complexity with good fit to our training data. So now let's consider a regression

problem and here we have some function h and h is a function of x that may be noisy and

we assume here a zero mean Gaussian noise and now we want to find a model y hat that

is estimated from some data set d and we can of course describe this process using the

expected loss and the expected loss for a single point you can now describe as the difference

between the true function and the estimated and estimate and you see this is essentially

an L2 loss right so we subtract the two take the power of two so this is our L2 loss and

Teil einer Videoserie :

Deep Learning

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

01:05:49 Min

Aufnahmedatum

2019-11-19

Hochgeladen am

2019-11-19 19:19:02

Sprache

en-US

Tags

Per RSS abonnieren