14 - Deep Learning - Plain Version 2020 [ID:21064]
50 von 146 angezeigt

Welcome back to part two of activation functions and convolutional neural networks.

Now we want to continue talking about the activation functions and the new ones used

in deep learning.

One of the most famous examples is the rectified linear unit, ReLU.

Now the ReLU we have already encountered earlier and the idea is to simply set the negative

half space to zero and the positive half space to x.

This then results in derivatives of one for the entire positive half space and zero everywhere

else.

So this is very nice because this way we get a good generalization.

Due to the piecewise linearity there is a significant speed up.

The function can be evaluated very quickly because we don't need the exponential functions

that are typically a bit slow on the implementation side.

We don't have that vanishing gradient problem because we have really large areas of high

values for the derivative of this function.

A drawback is it's not zero centered.

Now this has not been solved with the rectified linear unit.

However these ReLUs they are a very big step forward.

With ReLUs you could build for the first time deeper networks and with deeper networks I

mean networks that have more than three hidden layers.

Typically in classical machine learning neural networks were limited to approximately three

layers because already at this point you get the vanishing gradient problem.

The lower layers never see any of the gradients and therefore never update their weights.

So ReLUs enabled the training of deep nets without unsupervised pre-training.

You could already build deep nets but you had to do pre-training.

With ReLUs you could train from scratch directly just putting your data in which was a big

step forward.

Also the implementation is very easy because the first derivative is one if the unit is

activated and zero otherwise.

So there is no second order effect.

One problem still remains.

Dying ReLUs.

If you have weights and biases trained to yield negative results for X then you simply

always end up with a zero derivative.

The ReLU always generates a zero output and this means that they no longer contribute

to your training process.

So it simply stops at this point and no updates are possible because of the zero derivative.

This precious ReLU is suddenly always zero and can no longer be trained.

So this is also quite frequently happening if you have a too high learning rate.

Here you want to be careful with setting learning rates.

There are a couple ways out of this which we will discuss in the next couple of videos.

One way to alleviate the problem is already using not just a ReLU but something that is

called leaky or parametric ReLU.

The approach here is that you not just set the negative half space to zero but you set

it to a scaled small number.

So you take alpha times X and you set alpha to be a small number.

Then you have a very similar effect as the ReLU but you don't end up with the dying

ReLU problem as the derivative is never zero but it's alpha.

A typical setting of the values is 0.01.

The parametric ReLU is a further extension.

Here you make alpha a trainable parameter.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:12:22 Min

Aufnahmedatum

2020-10-11

Hochgeladen am

2020-10-11 19:56:19

Sprache

en-US

Deep Learning - Activations, Convolutions, and Pooling Part 2

This video presents a variety of modern activation functions including the quest for finding new ones.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen