4 - Mathematical Basics of Artificial Intelligence, Neural Networks and Data Analytics I [ID:36465]
50 von 1668 angezeigt

Thank you.

So before the break, we had this as a last slide showing that the optimization in a high

nonlinear multidimensional environment is something different to find as fast as possible

minimum in a simple range.

And the normal classical unconstrained optimization lectures, they are more in the direction to

do to follow this outline here why in the neural network environment we have to have

such pictures in our mind.

So nevertheless, there's a lot of literature in trying to improve the learning behavior.

I will pass only fast through this, not in all the details.

One very old and practically all software packages included method is the so-called

momentum back propagation ideas as follows.

You do not take the weight shift here as in what we have computed before here as in the

pattern by pattern learning, but you take the old weight shift that you have had before

with an alpha element before.

So this is called momentum or memory part of it.

So what does it mean if I have such a slope here between here and here?

This means that I have more and more and more and more of the old steps included, but not

this a full step from before, only part of it.

So this alpha is always smaller than one.

And if you do so, then you have an exponential decaying sum of the old gradients included

here.

Now, so you can have two extreme cases.

Let us assume that the gradients are all going the same direction here, which means the gradient

K is the same as the average gradient.

Then you can take the G here out of the bracket and you have simply the sum of the exponentials

which means this is one over one minus alpha.

So therefore, if the gradient direction is always in the same direction, the applied

gradient here would be this delta W is equal to this thing here.

What happens if you have a very strong fluctuation?

So only six seconds here.

Now, this means that GK is fluctuating of another gradient here.

And so if this is going forward, backward, what happens then?

Then you have an alternating because of the powers of alpha here, you have an alternating

behavior here, a minus sign here, a plus sign, a minus sign, and so on and so on.

Nevertheless, the size of the gradient here is the same.

You can put it in front.

And then if you have this fluctuating alpha here, you can put it together and then it

ends up with denominator one plus alpha here, which means dependent if you have more or

less fluctuations or more or less trend, you have a different learning rate.

But this thing here is the same for all the different weights in a weight vector.

So the momentum backvulgation is handling this in the same.

And if you look at the nowadays toolboxes, you'll find something which is called Adam.

I think most of them have included this.

And Adam as an idea is a mixture of this value at a learning, which Telemann and Hinton have

reinvented 2012.

And the momentum learning, which we have discussed before, and if you put both sides together,

you get such an engineering learning rule from Kingma and Barr.

So the idea here is that you have an adaptive learning rate.

And they try to combine the effects from here and here.

And please have in mind when people promote that this is the best possible idea, they

Zugänglich über

Offener Zugang

Dauer

02:53:32 Min

Aufnahmedatum

2021-10-12

Hochgeladen am

2021-10-12 21:06:06

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen