19 - Deep Learning - Plain Version 2020 [ID:21114]
50 von 98 angezeigt

Welcome back to our deep learning lecture. Today we want to talk a bit more about regularization.

The first topic that I want to introduce is an idea of normalization. So why could normalization

be a problem? Let's look at some original data here. A typical approach is that you

subtract the mean and then you also normalize the variance. This is very

useful because then we are in an expected range regarding the input. Of

course if you do these normalizations then you want to estimate them only on

the training data set. Another approach is not just to normalize before the

network but you can also normalize within the network. This then leads to

the concept of batch normalization. The idea is to introduce new layers with

parameters gamma and beta. Gamma and beta are being used to rescale the output of

the layer. At the input layer you start measuring the mean and the standard

deviation of the batch. So what you do is you compute over the mini batch the

current mean and standard deviation and then you use that to normalize the

activations of the input. As a result you have a zero mean and a unit standard

deviation. So this is calculated on the fly throughout the training process and

then you scale the output of the layer with the trainable weights beta and gamma.

You scale them appropriately as desired for the next layer. So this nice feature

can of course only be done during the training. If you want to move on towards

the test time you compute after you finish the training the mean and the

standard deviation of the batch normalization layer once for the entire

training set and you keep it constant for all future applications of the

network. Of course mu and sigma are vectors and they have exactly the same

dimension as the activation vector. Typically you pair this with

convolutional layers. In this case the batch normalization is slightly

different by using a scalar mu and a scalar sigma for every channel. So

there's a slight difference if you use it in convolution. You may argue why does

batch normalization actually help. There's overwhelming practical evidence

that it's really a useful tool and it has been originally motivated by the

internal covariate shift. So the problem that the Reload is not zero centered was

identified. Hence initialization and input distribution may not be normalized.

Therefore the input distribution shifts over time. In deeper nets you can even

get an amplified effect. As a result the layers constantly have to adapt and this

leads to slow training. Now if you look at these observations closely there is

very little concrete evidence to support this theory. On NURBS 2018 there was a

paper that showed how batch normalization helps optimization. They

showed that batch normalization is effective even if you introduce an

internal covariate shift after the batch normalization layer again. So still in

these situations batch normalization helped. They could show experimentally

that the method smooths the loss landscape. This supports the optimization

and also helped to lower the Lipschitzness of the loss function. The

Lipschitz constant is the highest slope that occurs in the function. If you

improve the Lipschitzness it essentially means that high slopes do no longer

occur. So this is in line with smoothing the loss landscape. Then they even could

provide a proof for this property in the paper. So batch normalization improved

stability with respect to hyperparameters, initialization, convergence

and similar properties. By the way this can also be observed for LP normalization.

So batch normalization is the way to go. There are generalizations of batch

normalization. You can calculate the mu and sigma over activations of a batch,

over a layer, over spatial dimensions, over a group, over the weight of a layer

and we might have missed some variations. So this is a very powerful tool. Another

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:09:06 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 14:16:31

Sprache

en-US

Deep Learning - Regularization Part 3

This video discusses normalization such as batch normalization and self normalizing units and explains the concepts of drop-out and drop-connect. 

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen