20 - Deep Learning - Plain Version 2020 [ID:21115]
50 von 80 angezeigt

Welcome back to deep learning. So today we want to look at a couple of initialization

techniques that will come in really handy throughout your work with deep learning networks.

So you may wonder why does initialization matter if you have convex functions? Actually

it doesn't matter at all because you follow the negative gradient direction and you will

always find the global minimum. So no problems for convex optimization. However many of the

problems that we are dealing with are non-convex. A non-convex function may have different local

minima. If I start at this point you can see that I achieve one local minimum by the optimization.

But if I were to start at this point you can see that I wouldn't end up with a very different

local minimum. So for non-convex problems initialization is actually a big deal. Neural

networks with non-linearities are in general non-convex. So what can be done? Well of course

you have to work with some initialization. For the biases you can start quite easily

initialize them to zero. This is very typical but keep in mind that if you're working with

a rectified linear unit you may want to start with a small positive constant. This is better

because of the dying ReLU issue. For the weights you need to be random to break the symmetry.

We already had this problem. In dropout we saw that we need additional regularization

in order to break the symmetry. Also it would be especially bad to initialize them with

zeros because then the gradient is zero. So this is something that you don't want to

do. Similar to the learning rate the variance influences the stability of the learning process.

Small uniform Gaussian values work. Now you may wonder how we can calibrate those

variances. Let's suppose we have a single linear neuron with weights w and input x.

Remember the capital letters here mark them as random variables. Then you can see that

the output is w times x. So this is a linear combination of the respective inputs plus

some bias. Now we're interested in the variance of y hat. If we assume that w and x are independent

then the variance of every product can actually be computed as the expected value of x to

the power of 2 times the variance of w plus the expected value of w to the power of 2

times the variance of x. And then you add the variances of the two random variables.

Now if we require x and w to have zero mean this would simplify the whole issue. The means

would be zero so the expected values cancel out and our variance would simply be the multiplication

of the two variance. Now we assume that x and w are independent and identically distributed.

In this special case we can then see that essentially n here is scaling our variances.

So it's actually dependent on the number of inputs that you have towards your layer. This

is a scaling of the variance with your w. So you can see that the weights are very important.

Effectively the more weights you have the more it scales the variance. As a result we

can then work with Xavier initialization. So we calibrate the variances for the forward

pass. We initialize with a zero mean Gaussian and we simply set the standard deviation to

one over fan in where fan in is the input dimension of the weights. So we simply scale

the variance to be one over the number of input dimensions. In the backward pass however

we would need the same effect backward. So we would have to scale the standard deviation

with one over fan out where fan out is the output dimension of the weights. So you just

average those two and compute a new standard deviation. This initialization is called after

the first author of 21. Well what else can be done? There's He initialization which

then also identifies the assumption of linear neurons as a problem. So in 12 they showed

that for ReLU's it's better to actually use the square root of two over fan in as the

standard deviation. So this is a very typical choice for initializing the weights randomly.

Then there are other initial choices that you typically do. L2 regularization you use

dropout with a probability of 0.5 for fully connected layers and you use them selectively

in convolutional neural networks. So you do mean subtraction, batch normalization and

He initialization. So this is the very typical setup. So which other tricks do we have left?

One important technique is transfer learning. Now transfer learning is typically used in

all situations where you have little data. One example is medical data. There you typically

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:09:00 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 13:56:35

Sprache

en-US

Deep Learning - Regularization Part 4

This video discusses initialization techniques and transfer learning.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen