Okay, so welcome everybody. It's great to see so many of you here. So I finally made
it back to the lecture. So I had a couple of issues regarding my health, but now everything
is back again, normal, so I can do the lecture again. So today's topic is going to be deep
learning and regularization. So far you've seen essentially how a neural network is constructed,
you've seen how you can compose different layers atop, you've seen universal approximation,
you discussed the different activation functions, how to compose such a network and how to train
it using back propagation algorithm. And today we want to look a bit into regularizing. So
we have this local gradient descent procedure and we want to make sure that we don't run into
troubles during the training and therefore regularization is an important technique that
is very widely applied. So let's discuss this issue a bit. So we have some data and in the
top left here you see a rather simple case. This case actually is linearly separable so
we would be able to even solve this one with a perceptron. But unfortunately our data in
real life typically does not look this way and we can easily separate the classes but
it's more complicated. So there's measurement noise in there, there's unclear examples,
there's cases where it's really hard to distinguish different classes. So what you typically can
expect is that there is no linear boundary that will separate your classes in the feature
space. So what do you do? Well, you somehow have to separate the classes and one thing
that you could do is you could just assume a linear decision boundary, right? So you
just draw a line and it's very nice because this line has very few parameters, it's rigid
and we can just separate the two classes but you will realize that the line may not be
a very good model to separate those two classes because you somehow very easily get rid, for
example, of the points down here so maybe you want to choose something else. The nice
thing with this line here however is that it has few parameters and you can very robustly
estimate this kind of model. What else you can do is you can of course increase your
model complexity, you can choose a model that can describe many, many different ways of
separating those two classes and what you may end up is something like this one. See,
you now have a decision boundary that is perfect. This one separates the two classes perfectly.
But the problem here is of course we introduce a lot of variance. You see that it's not very
stable and of course if I increase the complexity of my decision boundary I can essentially
find a model that will always separate my training data. So I can find these very, very
complex decision boundaries and they have a perfect fit. So this would result in a zero
loss because both of the classes are perfectly separated but we don't know how well this
will actually be applicable to a new data set. If you now draw from the same distribution
and let's assume that maybe this is a Gaussian distribution and this one then this decision
boundary may be not that good after all. So what you want to have in the end is something
that makes sense. You want to have a decision boundary that is kind of stable and that will
generalize so if you assume that we now would draw a new set of points this decision boundary
would probably generalize and we would still be able to use this decision boundary on unseen
test data. I mean training data is good and having a good performance on the training
data is great but in the very end you want to use your model on data you have not seen
before and therefore you need something that generalizes well. And this kind of problem
can be described also in a general sense and this results then in the bias variance decomposition.
So the bias variance decomposition can help us describe how we can trade off essentially
model complexity with good fit to our training data. So now let's consider a regression
problem and here we have some function h and h is a function of x that may be noisy and
we assume here a zero mean Gaussian noise and now we want to find a model y hat that
is estimated from some data set d and we can of course describe this process using the
expected loss and the expected loss for a single point you can now describe as the difference
between the true function and the estimated and estimate and you see this is essentially
an L2 loss right so we subtract the two take the power of two so this is our L2 loss and
Presenters
Zugänglich über
Offener Zugang
Dauer
01:05:49 Min
Aufnahmedatum
2019-11-19
Hochgeladen am
2019-11-19 19:19:02
Sprache
en-US