Welcome back to part two of Activation Functions and Convolutional Neural Networks.
Deep learning is like a component of a bigger system.
Now we want to continue talking about the activation functions and the new ones used
in deep learning. One of the most famous examples is the rectified linear unit, the Rayleigh.
Now the Rayleigh we already encountered and the idea is simply to set the negative half
space to zero and the positive half space to x. This then results in derivatives of
one for the entire positive half space and zero everywhere else. So this is very nice
because this way we get a good generalization due to the piecewise linearity. There's a
significant speed up. The function can be evaluated very quickly because we don't need
the exponential functions that are typically a bit slow on the implementation side and
we don't have this vanishing gradient problem because we have really large areas of high
values for the derivative of this function. Still a drawback is it's not zero centered.
This has not been solved with the rectified linear unit. However, this ReLU's, they were
a big step forward and with ReLU's you could for the first time build deeper networks and
with deeper networks I mean networks that have more hidden layers than three. Typically
in classical machine learning neural networks were limited to approximately three layers
because already at this point you get the vanishing gradient problem and the lower layers
never seen any of the gradient and therefore never updated their weights. So ReLU's enabled
the training of deep nets without unsupervised pre-training. You could already build deep
nets but you had to do unsupervised pre-training and here you could train from scratch directly
just putting your data in which was a big step forward. And also the implementation
is very easy because the first derivative is one if the unit is activated and just zero
otherwise. So there's no second order effect. One problem still remains the dying ReLU's
because if you have weight biases trained to yield negative results for x then you simply
always end up with a zero derivative and the ReLU always performs a zero output and this
means then that the ReLU no longer contributes to your training process during the feature
space. So it simply stops at this point. So no updates are possible because of the zero
derivative and this precious ReLU is suddenly always zero and can no longer be trained.
This is also quite frequently happening if you have a too high learning rate here. So
you may want to be careful with too high learning rates and there's a couple of ways out of
this which we will discuss in the next couple of videos. One way to alleviate the problem
is already using not just the ReLU but something that is called leaky ReLU or parametric ReLU.
And the approach here is that you not set the negative half space to zero but you set
it to a scaled small number. So you take alpha times x and you set alpha to be a small number
then you have a very similar effect as the ReLU but you don't end up with the dying ReLU
problem because the derivative is never zero but it's alpha. And for the leaky ReLU alpha
is typically set to values like 0.01. The parametric ReLU is a further extension and
here you make alpha a trainable parameter. So you can actually learn for every activation
function how large alpha should be in a system called PreLU. There's also exponential linear
units and here the idea is that you find for the negative half space a smooth function
that slowly decays and you can see here we set it to alpha times exponent of x minus
one and this then results in the derivatives one and alpha exponent x. So also an interesting
way to get a saturating effect here so we have no vanishing gradient and it also reduces
the shift in activations because we can get also negative output values. If you choose
this variant of the exponential linear unit you add an additional scaling lambda and if
you have inputs with zero mean and unit variance you can choose alpha and lambda with these
two values as reported here and they will preserve a zero mean and unit variance. So
the SeLU also gets rid of the problem of the internal covariate shift. So it's an alternative
variant of PreLU and the nice property is that if you have this zero mean unit variance
input then it will remain in the same scale and you don't have the internal covariate
Presenters
Zugänglich über
Offener Zugang
Dauer
00:11:56 Min
Aufnahmedatum
2020-04-28
Hochgeladen am
2020-04-29 00:36:17
Sprache
en-US
Deep Learning - Activations, Convolutions, and Pooling Part 2
This video presents a variety of modern activation functions including the quest for finding new ones.
Video References:
Lex Fridman's Channel
Further Reading:
A gentle Introduction to Deep Learning