14 - Deep Learning - Activations, Convolutions, and Pooling Part 2 [ID:16882]

50 von 96 angezeigt

Welcome back to part two of Activation Functions and Convolutional Neural Networks.

Now we want to continue talking about the activation functions and the new ones used in deep learning.

One of the most famous examples is the rectified linear unit, the Rayleigh.

Now the Rayleigh we already encountered and the idea is simply to set the negative half space to 0 and the positive half space to x.

This then results in derivatives of 1 for the entire positive half space and 0 everywhere else.

So this is very nice because this way we get a good generalization due to the piecewise linearity.

There's a significant speed up. The function can be evaluated very quickly because we don't need the exponential functions

that are typically a bit slow on the implementation side and we don't have this vanishing gradient problem

because we have really large areas of high values for the derivative of this function.

Still a drawback is it's not zero centered. This has not been solved with the rectified linear unit.

However, this ReLU's, they were a big step forward and with ReLU's you could for the first time build deeper networks

and with deeper networks I mean networks that have more hidden layers than three.

Typically in classical machine learning neural networks were limited to approximately three layers

because already at this point you get the vanishing gradient problem and the lower layers never seen any of the gradient

and therefore never updated their weights.

So ReLU's enabled the training of deep nets without unsupervised pre-training.

You could already build deep nets but you had to do unsupervised pre-training

and here you could train from scratch directly just putting your data in which was a big step forward.

And also the implementation is very easy because the first derivative is one if the unit is activated and just zero otherwise.

So there's no second order effect.

One problem still remains the dying ReLU's because if you have weights biases trained to yield negative results for x

then you simply always end up with a zero derivative and the ReLU always performs a zero output

and this means then that the ReLU no longer contributes to your training process during the feature space.

So it simply stops at this point.

So no updates are possible because of the zero derivative and this precious ReLU is suddenly always zero and can no longer be trained.

This is also quite frequently happening if you have a too high learning rate.

So you may want to be careful with too high learning rates and there's a couple of ways out of this which we will discuss in the next couple of videos.

One way to alleviate the problem is already using not just the ReLU but something that is called leaky ReLU or parametric ReLU.

And the approach here is that you not set the negative half space to zero but you set it to a scaled small number.

So you take alpha times x and you set alpha to be a small number then you have a very similar effect as the ReLU

but you don't end up with the dying ReLU problem because the derivative is never zero but it's alpha.

And for the leaky ReLU alpha is typically set to values like 0.01.

The parametric ReLU is a further extension and here you make alpha a trainable parameter.

So you can actually learn for every activation function how large alpha should be in a system called PreLU.

There's also exponential linear units and here the idea is that you find for the negative half space a smooth function that slowly decays.

And you can see here we set it to alpha times exponent of x minus one.

And this then results in the derivatives of one and alpha exponent x.

So also an interesting way to get a saturating effect here so we have no vanishing gradient

and it also reduces the shift in activations because we can get also negative output values.

If you choose this variant of the exponential linear unit you add an additional scaling lambda.

And if you have inputs with zero mean and unit variance you can choose alpha and lambda with these two values as reported here

and they will preserve a zero mean and unit variance.

So the SeLU also gets rid of the problem of the internal covariate shift.

So it's an alternative variant of ReLU and the nice property is that if you have this zero mean unit variance input

then it will remain in the same scale and you don't have the internal covariate shift.

Another thing that we can do about internal covariate shift is batch normalization and this is something that we'll talk about in a couple of videos.

Okay what other activation functions? There is maxout that learns the activation function.

There is radio basis functions that can be employed.

There is soft plus which is a logarithm of one plus the e to the power of x that was found to be less efficient than ReLU.

This is actually getting ridiculous isn't it? So what should we use now?

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:11:35 Min

Aufnahmedatum

2020-05-30

Hochgeladen am

2020-05-30 18:26:36

Sprache

en-US

Deep Learning - Activations, Convolutions, and Pooling Part 2

This video presents a variety of modern activation functions including the quest for finding new ones.

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren