Thank you.
So before the break, we had this as a last slide showing that the optimization in a high
nonlinear multidimensional environment is something different to find as fast as possible
minimum in a simple range.
And the normal classical unconstrained optimization lectures, they are more in the direction to
do to follow this outline here why in the neural network environment we have to have
such pictures in our mind.
So nevertheless, there's a lot of literature in trying to improve the learning behavior.
I will pass only fast through this, not in all the details.
One very old and practically all software packages included method is the so-called
momentum back propagation ideas as follows.
You do not take the weight shift here as in what we have computed before here as in the
pattern by pattern learning, but you take the old weight shift that you have had before
with an alpha element before.
So this is called momentum or memory part of it.
So what does it mean if I have such a slope here between here and here?
This means that I have more and more and more and more of the old steps included, but not
this a full step from before, only part of it.
So this alpha is always smaller than one.
And if you do so, then you have an exponential decaying sum of the old gradients included
here.
Now, so you can have two extreme cases.
Let us assume that the gradients are all going the same direction here, which means the gradient
K is the same as the average gradient.
Then you can take the G here out of the bracket and you have simply the sum of the exponentials
which means this is one over one minus alpha.
So therefore, if the gradient direction is always in the same direction, the applied
gradient here would be this delta W is equal to this thing here.
What happens if you have a very strong fluctuation?
So only six seconds here.
Now, this means that GK is fluctuating of another gradient here.
And so if this is going forward, backward, what happens then?
Then you have an alternating because of the powers of alpha here, you have an alternating
behavior here, a minus sign here, a plus sign, a minus sign, and so on and so on.
Nevertheless, the size of the gradient here is the same.
You can put it in front.
And then if you have this fluctuating alpha here, you can put it together and then it
ends up with denominator one plus alpha here, which means dependent if you have more or
less fluctuations or more or less trend, you have a different learning rate.
But this thing here is the same for all the different weights in a weight vector.
So the momentum backvulgation is handling this in the same.
And if you look at the nowadays toolboxes, you'll find something which is called Adam.
I think most of them have included this.
And Adam as an idea is a mixture of this value at a learning, which Telemann and Hinton have
reinvented 2012.
And the momentum learning, which we have discussed before, and if you put both sides together,
you get such an engineering learning rule from Kingma and Barr.
So the idea here is that you have an adaptive learning rate.
And they try to combine the effects from here and here.
And please have in mind when people promote that this is the best possible idea, they
Presenters
Zugänglich über
Offener Zugang
Dauer
02:53:32 Min
Aufnahmedatum
2021-10-12
Hochgeladen am
2021-10-12 21:06:06
Sprache
en-US