Welcome back to pattern recognition. So today we want to look a bit more into optimization
and the topic today will be looking into the actual update direction. So we've seen the
gradient descent methods in the previous video and today we want to have a couple of thoughts
in how to pick the particular update direction.
So these are these kind of steepest descent methods and even the normalized ones we might
want to consider what update direction we actually want to choose. And what we actually
want to get is the largest decrease in the linear approximation of f and technically
we could also constrain this gradient direction by a unit ball of an LP norm. So here we would
then search for the update direction as the minimum over some u that has a length of one
according to our norm and then the projection of the gradient onto this norm. So we observe
that if we do this kind of optimization for selecting the update direction then the steepest
direction might not be simply the negative gradient direction but it may depend on the
chosen norm. So the negative gradient is not necessarily the best choice for the search
direction. So let's look into a bit of ideas how to perform this and here we want to think
about some linear ideas and we consider now the first order Taylor approximation of f
of x plus u and if we want to look at that in the selected position x then we can approximate
f of x plus u as f of x plus the gradient of f of x inner product with this unit ball
u. So here this inner product of the gradient and the unit ball is the directional derivative
of x in direction u. So the vector u denotes a descent direction if the inner product with
the gradient vector is negative which means that we have this inner product smaller than
zero. Now this gives then rise to a new steepest descent method. We again have the function,
we have the initial estimate but now we also have a norm and we initialize with k to zero
and the first thing that we do is we compute the update direction. So here it's not just
the negative gradient but we compute the steepest descent according to our norm then we do the
1D line search and we update with the appropriate fit and then we iterate until we are converged
so there's only little change in x. Now let's have a look at different norms. Let's look
at the unit ball for the L2 norm. So here we have indicated the negative gradient direction.
This is our unit ball and now we are looking for essentially all of the directions u and
we seek the direction u that has essentially the largest projection of our negative gradient
direction onto u and if we think about that you can see well it is exactly the negative
gradient direction. So for the case of the L2 norm our statement that the negative gradient
direction is the direction of steepest descent is actually true. Now let's consider other
norms and we'll start with the L1 norm. Now in the L1 norm we have a different unit ball.
Our unit ball looks like this. Now again we vary u over all of the boundaries of our unit
ball and now we look for the direction that produces the maximum projection of the negative
gradient onto our unit ball and we see it lies here. So this is the longest projection
of our negative gradient direction onto the unit ball of the L1 norm. So this is interesting
because we essentially end up exactly with one of the coordinate axes. Now let's look
into a second example and here we see if the negative gradient direction would be in this
direction again we end up with exactly a projection onto one of our coordinate axes. So actually
the steepest descent for the L1 norm selects in each iteration the component of our gradient
with the maximum absolute value and then decreases or increases dependent on the sign of each
selected component in that direction. So if you define i to be the index of the gradient
component with maximum absolute value then we can find some base vectors EI in our d
dimensional space that corresponds to this coordinate system and then the steepest descent
direction is simply given as the minimization over u as we've seen previously with respect
to the L1 norm and this then turns out to be simply the sign of the derivative of the
function with respect to exactly this coordinate. So this essentially means that steepest gradient
descent using the L1 norm results in coordinate descent. So coordinate descent is essentially
minimizing according to the L1 norm. What happens if we take other norms? Let's look
Presenters
Zugänglich über
Offener Zugang
Dauer
00:16:10 Min
Aufnahmedatum
2020-11-11
Hochgeladen am
2020-11-11 13:57:32
Sprache
en-US
In this video, we see how different norms alter the direction of the gradient during the optimization.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You