Welcome back to part two of activation functions and convolutional neural networks.
Now we want to continue talking about the activation functions and the new ones used
in deep learning.
One of the most famous examples is the rectified linear unit, ReLU.
Now the ReLU we have already encountered earlier and the idea is to simply set the negative
half space to zero and the positive half space to x.
This then results in derivatives of one for the entire positive half space and zero everywhere
else.
So this is very nice because this way we get a good generalization.
Due to the piecewise linearity there is a significant speed up.
The function can be evaluated very quickly because we don't need the exponential functions
that are typically a bit slow on the implementation side.
We don't have that vanishing gradient problem because we have really large areas of high
values for the derivative of this function.
A drawback is it's not zero centered.
Now this has not been solved with the rectified linear unit.
However these ReLUs they are a very big step forward.
With ReLUs you could build for the first time deeper networks and with deeper networks I
mean networks that have more than three hidden layers.
Typically in classical machine learning neural networks were limited to approximately three
layers because already at this point you get the vanishing gradient problem.
The lower layers never see any of the gradients and therefore never update their weights.
So ReLUs enabled the training of deep nets without unsupervised pre-training.
You could already build deep nets but you had to do pre-training.
With ReLUs you could train from scratch directly just putting your data in which was a big
step forward.
Also the implementation is very easy because the first derivative is one if the unit is
activated and zero otherwise.
So there is no second order effect.
One problem still remains.
Dying ReLUs.
If you have weights and biases trained to yield negative results for X then you simply
always end up with a zero derivative.
The ReLU always generates a zero output and this means that they no longer contribute
to your training process.
So it simply stops at this point and no updates are possible because of the zero derivative.
This precious ReLU is suddenly always zero and can no longer be trained.
So this is also quite frequently happening if you have a too high learning rate.
Here you want to be careful with setting learning rates.
There are a couple ways out of this which we will discuss in the next couple of videos.
One way to alleviate the problem is already using not just a ReLU but something that is
called leaky or parametric ReLU.
The approach here is that you not just set the negative half space to zero but you set
it to a scaled small number.
So you take alpha times X and you set alpha to be a small number.
Then you have a very similar effect as the ReLU but you don't end up with the dying
ReLU problem as the derivative is never zero but it's alpha.
A typical setting of the values is 0.01.
The parametric ReLU is a further extension.
Here you make alpha a trainable parameter.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:22 Min
Aufnahmedatum
2020-10-11
Hochgeladen am
2020-10-11 19:56:19
Sprache
en-US
Deep Learning - Activations, Convolutions, and Pooling Part 2
This video presents a variety of modern activation functions including the quest for finding new ones.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning