Okay, let's start. Welcome everyone to today's lecture in deep learning. I'm filling in for
Professor Meier today. My name is Katharina and today we will talk about regularization.
So far you've heard about the motivation for neural networks, the motivation more specifically
for CNNs, what is a loss function, what are the different parts of a network including
convolutional layers, fully connected layers, activation functions, cooling layers, etc.
and how to optimize the network with respect to a certain loss function. So we should be
more or less set and done to start training our network given a huge amount of training
data. However, generally in a real world setting we only have a very limited amount of training
data and we need to prevent powerful networks from overfitting to our training data such
that they generalize well to unobserved data. So today we will talk about the theoretical
background behind regularization. So how can we prevent our network from actually overfitting?
We will get into classical techniques, talk about normalization and strategies that are
more targeted to networks including dropout and initialization and then go to more advanced
topics that, for example, transfer learning and multitask learning. So let's start with
an introduction to regularization. So assume that we have two classes and samples from
two classes and in this case we can actually see that they are pretty well separated and
pretty well separable. So you can think of a number of decision boundaries that separate
these two classes. However, generally our data doesn't look as nice. Instead it looks
more something like this. For example, because of sensor noise, measurement noise, because
of mislabeled data or because the data is ambiguous in a certain sense. Remember that
these ImageNet examples where we have ambiguous or even wrong labels. So we still want to
find a sensible decision boundary between those two. Now, depending on the model that
we choose, we can, for example, observe underfitting. So our model is not strong enough to separate
the two classes in a sensible way. Instead it finds a decision boundary that is too simple
to separate them. On the other end of the spectrum we can also have a very powerful
model that can adapt to our training data very well. And in this case this is represented
by this wiggly line that you see here. You should remember from pattern recognition that
in this case we don't expect our model to generalise very well to unobserved data and
that this wiggly line will look quite different depending on the training samples that we
actually have. So this is what we call overfitting. The model to our training data. So what we
are actually looking for is a sensible boundary that discriminates the two classes. It shouldn't
be too adapted to the training data but it shouldn't underfit our underlying distribution.
So to have a look at this from a more theoretical perspective, we now go to a regression problem
where we have a true distribution and noise that is added to a model. So this is a model
where we have a true distribution and noise that is added to our underlying distribution.
Now we have a model that is estimated based on a data set and we can define the expected
loss as described here. So we have a mean squared loss and calculate the expectation
value over a number of data sets basically that we use for training. So some form of
instantiation of our model. So we can use the bias variance decomposition to decompose
this expectation value into three parts. On one hand we have the bias part, the variance
part and then on the other hand we have the irreducible error. The term irreducible error
kind of fits us in the way that we don't really have to take care of this anymore but we can
just ignore it in our additional observations of this. So we can't reduce it anyway so let's
not take care about it at the moment. Then the bias is quite interesting in the sense
that you can see here both the expectation value of our model, so how the bias is going
to be in our model. So you can see the expectation value of our model and the true distribution
that underlies it. So this kind of represents the difference that our model has, the mean
difference that our model has to our underlying distribution. And then in the variance we
don't care about the true distribution at all but we are just looking at the variance
of different model instantiations. So how much do different data sets, how much do models
Presenters
Zugänglich über
Offener Zugang
Dauer
01:04:36 Min
Aufnahmedatum
2018-05-09
Hochgeladen am
2018-05-09 16:09:04
Sprache
en-US