Welcome back to our deep learning lecture. Today we want to talk a bit more about regularization.
The first topic that I want to introduce is an idea of normalization. So why could normalization
be a problem? Let's look at some original data here. A typical approach is that you
subtract the mean and then you also normalize the variance. This is very
useful because then we are in an expected range regarding the input. Of
course if you do these normalizations then you want to estimate them only on
the training data set. Another approach is not just to normalize before the
network but you can also normalize within the network. This then leads to
the concept of batch normalization. The idea is to introduce new layers with
parameters gamma and beta. Gamma and beta are being used to rescale the output of
the layer. At the input layer you start measuring the mean and the standard
deviation of the batch. So what you do is you compute over the mini batch the
current mean and standard deviation and then you use that to normalize the
activations of the input. As a result you have a zero mean and a unit standard
deviation. So this is calculated on the fly throughout the training process and
then you scale the output of the layer with the trainable weights beta and gamma.
You scale them appropriately as desired for the next layer. So this nice feature
can of course only be done during the training. If you want to move on towards
the test time you compute after you finish the training the mean and the
standard deviation of the batch normalization layer once for the entire
training set and you keep it constant for all future applications of the
network. Of course mu and sigma are vectors and they have exactly the same
dimension as the activation vector. Typically you pair this with
convolutional layers. In this case the batch normalization is slightly
different by using a scalar mu and a scalar sigma for every channel. So
there's a slight difference if you use it in convolution. You may argue why does
batch normalization actually help. There's overwhelming practical evidence
that it's really a useful tool and it has been originally motivated by the
internal covariate shift. So the problem that the Reload is not zero centered was
identified. Hence initialization and input distribution may not be normalized.
Therefore the input distribution shifts over time. In deeper nets you can even
get an amplified effect. As a result the layers constantly have to adapt and this
leads to slow training. Now if you look at these observations closely there is
very little concrete evidence to support this theory. On NURBS 2018 there was a
paper that showed how batch normalization helps optimization. They
showed that batch normalization is effective even if you introduce an
internal covariate shift after the batch normalization layer again. So still in
these situations batch normalization helped. They could show experimentally
that the method smooths the loss landscape. This supports the optimization
and also helped to lower the Lipschitzness of the loss function. The
Lipschitz constant is the highest slope that occurs in the function. If you
improve the Lipschitzness it essentially means that high slopes do no longer
occur. So this is in line with smoothing the loss landscape. Then they even could
provide a proof for this property in the paper. So batch normalization improved
stability with respect to hyperparameters, initialization, convergence
and similar properties. By the way this can also be observed for LP normalization.
So batch normalization is the way to go. There are generalizations of batch
normalization. You can calculate the mu and sigma over activations of a batch,
over a layer, over spatial dimensions, over a group, over the weight of a layer
and we might have missed some variations. So this is a very powerful tool. Another
Presenters
Zugänglich über
Offener Zugang
Dauer
00:09:06 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 14:16:31
Sprache
en-US
Deep Learning - Regularization Part 3
This video discusses normalization such as batch normalization and self normalizing units and explains the concepts of drop-out and drop-connect.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning