20 - Deep Learning - Regularization Part 4 [ID:16895]

50 von 75 angezeigt

Welcome back to deep learning. So today we want to look at a couple of initialization techniques

that will come in really handy throughout your work with deep learning networks. So you may

wonder why does initialization matter? If you have a convex function actually it doesn't matter at all

because you follow the negative gradient direction and you will always find the global minimum. So no

problem for convex optimization. However many of the problems that we are dealing with are non

convex and a non convex function may have different local minima and now if I start at this point you

can see that I achieve one local minimum by the optimization but if I were to start at this point

you can see that I would end up with a very different local minimum. So for non convex problems

initialization is actually a big deal and neural networks with non-linearity are in general non

convex. So what can be done? Well of course you have to work with some initialization and for the

biases you can work quite easily and initialize them to zero. So this is very typical keep in mind

that if you're working with ReLUs you may want to start with a small positive constant because this

is better because of the dying ReLU issue. For the weights well for the weights you need to be

random to break the symmetry. We already had this problem in dropout that we need additional

regularization in order to break the symmetry and it would be especially bad to initialize them with

zeros because then the gradient is zero. So this is something that you don't want to do. Similar to

the learning rate their variance influences the stability of the learning process. So small uniform

Gaussian values work. Now you may wonder how can we calibrate those variances and let's suppose we

have a single linear neuron with weights w and input x and remember that the capital letters here

mark them as random variables then you can see that the output is w times x so this is this linear

combination of the respective inputs plus some bias and now we are interested in the variance of

y hat. If we assume that w and x are independent then the variance of every product can be actually

computed as the expected value of x to the power of 2 times the variance of w plus the expected

value of w to the power of 2 times the variance of x and then you add the variances of the two

random variables. Now if we have w and x to have zero mean then this would simplify the whole issue

because the means would be zero so the expected values cancel out and our variance would simply

the multiplication of the two variance. Now we assume that xn and wn are independent and identically

distributed. In this special case we can then see that essentially the n here scales our variances.

So it's actually dependent on the number of inputs that you have towards your layer and this is an

scale of the variance with your wn. So you see that the weights are very important and effectively

the more weights you have the more it scales the variance. As a result we then can work with

Xavier initialization. So we calibrate the variances for the forward pass, we initialize

with a zero mean Gaussian and we simply set the standard deviation to one over fan in where fan

in is the input dimension of the weights. So we simply scale the variance to be one over the number

of input dimensions. In the backward pass however we would need the same effect backwards so we would

have to scale the standard deviation with one over fan out where fan out is the output dimension of

the weights. So you just average those two and compute a new standard deviation and this

initialization is called after the first author of reference 21. Well what else can be done? There's

He initialization which then also considers that the assumption of linear neurons is a problem. So

in reference 12 they showed that for ReLUs it's better to actually use the square root of 2 over

fan in a standard deviation. So this is a very typical choice for initializing the weights randomly.

Then other conventional initial choices is that you do L2 regularization, you use dropout with a

probability of 0.5 for fully connected layers and you use them selectively in convolutional

neural networks, you do mean subtraction, batch normalization and He initialization. So this is

the very typical setup. Okay so what other tricks of the trade do we have left? One important

technique is transfer learning. Now transfer learning is typically used in all situations where

you have few data. One example are medical data sets, there you typically have very few data

available. So the idea is then to reuse models trained on ImageNet for example so you can even

reuse things that have been trained on a different task for the same data. You can also use different

data for the same task or you can even do different data on a different task. So now the question is

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:09:26 Min

Aufnahmedatum

2020-05-31

Hochgeladen am

2020-05-31 18:26:36

Sprache

en-US

Deep Learning - Regularization Part 4

This video discusses initialization techniques and transfer learning.

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren