19 - Deep Learning - Regularization Part 3 [ID:15358]

50 von 94 angezeigt

Welcome to our deep learning lecture. Today we want to talk about regularization a bit more.

The first topic that I want to introduce are some ideas on normalization.

The stuff that works best is really simple.

So why could normalization be a problem? So let's see some original data here.

A typical approach is that you subtract the mean and then you can also normalize the variance.

And this is very useful because then we are in an expected range regarding the input.

Of course, if you do these normalizations, then you want to use them only on the training data to calculate this normalization.

So there's also normalization within networks, and this is achieved by batch normalization.

The idea in batch normalization is that you add a new layer with two new parameters, gamma and beta.

And at the input of the layer, you start measuring the mean and the standard deviation of the batch.

So what you do is you compute over the mini-batch the current mean and standard deviation.

And then you use that to normalize the activations of the input to this layer to have a zero mean and a unit standard deviation.

So this is calculated on the fly throughout the training process.

And then you scale at the output of the layer with the trainable weights beta and gamma.

You scale them appropriately as desired for the next layers.

So this nice feature can, of course, only be done during the training.

If you want to move on then towards test time, you do after you finish the training,

you compute the mean and the standard deviation in the batch normalization layer once,

and you keep it then constant for all future applications of the network.

Of course, mu and sigma are vectors and they have exactly the same dimension as the activation vector.

If you pair this with convolutional layers, the batch normalization is slightly different by using a scalar and a sigma for every channel.

So there's a slight difference if you use it in convolutional layers.

So you may argue, why does batch normalization actually help?

And there's overwhelming practical evidence that it's really a useful tool.

This has been originally motivated by the internal covariate shift.

So the problem that the relu is not zero centered was identified.

And then also initialization and input distribution may not be normalized.

And therefore the input distribution shifts over time.

In deeper nets, you even get amplified effect.

So as a result, the layers constantly have to adapt and then this leads to slow learning.

Now, if you look at these observations closely, there is very little concrete evidence to support this theory.

We are happy that it works better than any competing method, but that doesn't mean that we think we are done.

On NURiBS 2018, there was a paper on how does batch normalization help optimization.

And here they showed that batch normalization is effective even if you introduce an internal covariate shift after the batch normalization layer again.

So still in these situations, batch normalization helped.

They could show experimentally that the batch normalization smooths the loss landscape, which is very nice for the optimization.

And also it helped to increase the Lipschitzness of the loss function and the gradient.

The Lipschitz constant is the highest slope that occurs in the function.

And if you improve the Lipschitzness, it essentially means that high slopes do no longer occur.

So this is in line with smoothing the loss landscape.

And they even could provide a proof for this property in the paper, which is very interesting.

There is nothing more practical than a good theory.

So batch normalization improves stability with respect to hyperparameters, initialization, convergence, and similar properties, by the way, can also be observed for LP normalization.

So that's also one way to go.

There's generalizations of batch normalization.

You can calculate the mu and the sigma over activations of a batch, over a layer, over spatial dimensions, over a group, over weights of layer.

There's many different variations of batch normalization.

And that's also a very powerful tool.

What is clear to me is that engineers and companies and labs and grad students will continue to tune architectures and explore all kinds of tweaks to make the current state of the art ever slightly better.

Another tool that is also very effective are the self-normalizing neural networks.