26 - Pattern Recognition [PR] - PR 22 [ID:23523]
50 von 130 angezeigt

Welcome back to pattern recognition. So today we want to look a bit more into optimization

and the topic today will be looking into the actual update direction. So we've seen the

gradient descent methods in the previous video and today we want to have a couple of thoughts

in how to pick the particular update direction.

So these are these kind of steepest descent methods and even the normalized ones we might

want to consider what update direction we actually want to choose. And what we actually

want to get is the largest decrease in the linear approximation of f and technically

we could also constrain this gradient direction by a unit ball of an LP norm. So here we would

then search for the update direction as the minimum over some u that has a length of one

according to our norm and then the projection of the gradient onto this norm. So we observe

that if we do this kind of optimization for selecting the update direction then the steepest

direction might not be simply the negative gradient direction but it may depend on the

chosen norm. So the negative gradient is not necessarily the best choice for the search

direction. So let's look into a bit of ideas how to perform this and here we want to think

about some linear ideas and we consider now the first order Taylor approximation of f

of x plus u and if we want to look at that in the selected position x then we can approximate

f of x plus u as f of x plus the gradient of f of x inner product with this unit ball

u. So here this inner product of the gradient and the unit ball is the directional derivative

of x in direction u. So the vector u denotes a descent direction if the inner product with

the gradient vector is negative which means that we have this inner product smaller than

zero. Now this gives then rise to a new steepest descent method. We again have the function,

we have the initial estimate but now we also have a norm and we initialize with k to zero

and the first thing that we do is we compute the update direction. So here it's not just

the negative gradient but we compute the steepest descent according to our norm then we do the

1D line search and we update with the appropriate fit and then we iterate until we are converged

so there's only little change in x. Now let's have a look at different norms. Let's look

at the unit ball for the L2 norm. So here we have indicated the negative gradient direction.

This is our unit ball and now we are looking for essentially all of the directions u and

we seek the direction u that has essentially the largest projection of our negative gradient

direction onto u and if we think about that you can see well it is exactly the negative

gradient direction. So for the case of the L2 norm our statement that the negative gradient

direction is the direction of steepest descent is actually true. Now let's consider other

norms and we'll start with the L1 norm. Now in the L1 norm we have a different unit ball.

Our unit ball looks like this. Now again we vary u over all of the boundaries of our unit

ball and now we look for the direction that produces the maximum projection of the negative

gradient onto our unit ball and we see it lies here. So this is the longest projection

of our negative gradient direction onto the unit ball of the L1 norm. So this is interesting

because we essentially end up exactly with one of the coordinate axes. Now let's look

into a second example and here we see if the negative gradient direction would be in this

direction again we end up with exactly a projection onto one of our coordinate axes. So actually

the steepest descent for the L1 norm selects in each iteration the component of our gradient

with the maximum absolute value and then decreases or increases dependent on the sign of each

selected component in that direction. So if you define i to be the index of the gradient

component with maximum absolute value then we can find some base vectors EI in our d

dimensional space that corresponds to this coordinate system and then the steepest descent

direction is simply given as the minimization over u as we've seen previously with respect

to the L1 norm and this then turns out to be simply the sign of the derivative of the

function with respect to exactly this coordinate. So this essentially means that steepest gradient

descent using the L1 norm results in coordinate descent. So coordinate descent is essentially

minimizing according to the L1 norm. What happens if we take other norms? Let's look

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:16:10 Min

Aufnahmedatum

2020-11-11

Hochgeladen am

2020-11-11 13:57:32

Sprache

en-US

In this video, we see how different norms alter the direction of the gradient during the optimization.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen