14 - Deep Learning - Activations, Convolutions, and Pooling Part 2 [ID:14456]
50 von 109 angezeigt

Welcome back to part two of Activation Functions and Convolutional Neural Networks.

Deep learning is like a component of a bigger system.

Now we want to continue talking about the activation functions and the new ones used

in deep learning. One of the most famous examples is the rectified linear unit, the Rayleigh.

Now the Rayleigh we already encountered and the idea is simply to set the negative half

space to zero and the positive half space to x. This then results in derivatives of

one for the entire positive half space and zero everywhere else. So this is very nice

because this way we get a good generalization due to the piecewise linearity. There's a

significant speed up. The function can be evaluated very quickly because we don't need

the exponential functions that are typically a bit slow on the implementation side and

we don't have this vanishing gradient problem because we have really large areas of high

values for the derivative of this function. Still a drawback is it's not zero centered.

This has not been solved with the rectified linear unit. However, this ReLU's, they were

a big step forward and with ReLU's you could for the first time build deeper networks and

with deeper networks I mean networks that have more hidden layers than three. Typically

in classical machine learning neural networks were limited to approximately three layers

because already at this point you get the vanishing gradient problem and the lower layers

never seen any of the gradient and therefore never updated their weights. So ReLU's enabled

the training of deep nets without unsupervised pre-training. You could already build deep

nets but you had to do unsupervised pre-training and here you could train from scratch directly

just putting your data in which was a big step forward. And also the implementation

is very easy because the first derivative is one if the unit is activated and just zero

otherwise. So there's no second order effect. One problem still remains the dying ReLU's

because if you have weight biases trained to yield negative results for x then you simply

always end up with a zero derivative and the ReLU always performs a zero output and this

means then that the ReLU no longer contributes to your training process during the feature

space. So it simply stops at this point. So no updates are possible because of the zero

derivative and this precious ReLU is suddenly always zero and can no longer be trained.

This is also quite frequently happening if you have a too high learning rate here. So

you may want to be careful with too high learning rates and there's a couple of ways out of

this which we will discuss in the next couple of videos. One way to alleviate the problem

is already using not just the ReLU but something that is called leaky ReLU or parametric ReLU.

And the approach here is that you not set the negative half space to zero but you set

it to a scaled small number. So you take alpha times x and you set alpha to be a small number

then you have a very similar effect as the ReLU but you don't end up with the dying ReLU

problem because the derivative is never zero but it's alpha. And for the leaky ReLU alpha

is typically set to values like 0.01. The parametric ReLU is a further extension and

here you make alpha a trainable parameter. So you can actually learn for every activation

function how large alpha should be in a system called PreLU. There's also exponential linear

units and here the idea is that you find for the negative half space a smooth function

that slowly decays and you can see here we set it to alpha times exponent of x minus

one and this then results in the derivatives one and alpha exponent x. So also an interesting

way to get a saturating effect here so we have no vanishing gradient and it also reduces

the shift in activations because we can get also negative output values. If you choose

this variant of the exponential linear unit you add an additional scaling lambda and if

you have inputs with zero mean and unit variance you can choose alpha and lambda with these

two values as reported here and they will preserve a zero mean and unit variance. So

the SeLU also gets rid of the problem of the internal covariate shift. So it's an alternative

variant of PreLU and the nice property is that if you have this zero mean unit variance

input then it will remain in the same scale and you don't have the internal covariate

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:11:56 Min

Aufnahmedatum

2020-04-28

Hochgeladen am

2020-04-29 00:36:17

Sprache

en-US

Deep Learning - Activations, Convolutions, and Pooling Part 2

This video presents a variety of modern activation functions including the quest for finding new ones.

Video References:
Lex Fridman's Channel

Further Reading:
A gentle Introduction to Deep Learning

Tags

introduction backpropagation artificial intelligence deep learning machine learning pattern recognition Feedforward Networks Gradient descent activation functions
Einbetten
Wordpress FAU Plugin
iFrame
Teilen