Welcome back to deep learning. So let's continue with our lecture. Today we want to talk a bit
about loss functions and optimization. I want to look into a couple of more optimization problems
and one of the optimization problems that we've already seen is the perceptron case.
You remember that we were minimizing a sum over all of the misclassified samples. Here we were
choosing this because we could somehow get rid of the sine function and only look into the samples
that were relevant for misclassification. Also note that here we don't have a 0,1 category but
a minus 1,1 because this allows us to multiply with the class label. This then will always result
in a negative number in misclassified samples. Then we add this negative sign in the very beginning
such that we always end up with a positive value. The smaller this positive value the smaller our
loss will be. So we seek to minimize this function. We don't have the sine function in this criterion
because we found an elegant way to formulate this loss function without it. Now if it were in we
would run into problems because this would count only the number of misclassifications and we would
not differentiate whether it's far away from the decision boundary or close to the decision boundary.
We could simply end up with a count. If we look at the gradient it would essentially vanish everywhere.
So it's not an easy optimization problem. We don't know in which direction to go so we can't find a
good optimization. What did we do about this last time? Well we somehow need to relax this and there
are also ways how to fix this. One way to go ahead is to include the so-called hinge loss. Now with
the hinge loss we can relax this zero one function into something that behaves linearly on a large
domain. The idea is that we essentially use a line that hits the x-axis at one and the y-axis also
at one. If we do it this way we can simply rewrite this using the max function. So the hinge loss is
then a sum over all the samples that essentially receive zero if our value is larger than one. So
we have to rewrite the right hand part to reformulate this a little. Here we take one minus y subscript m
times y hat. Here you can see that we will have the same constraint. If we have opposite sides of
the boundary this term will be negative and by design it will of course be flipped such that we
end up having large values for high numbers of misclassifications. We got rid of the problem of
having to find the set of misclassifications capital M. Now we can take the full set of samples by using
this max function because everything that will fulfill this constraint will automatically be
clamped to zero. So it will not influence this loss function. Thus it's a very interesting way of
formulating the same problem. We have implicitly the situation that we only consider the misclassified
samples in this loss function. It can be shown that the hinge loss is a convex approximation
of the misclassification loss that we considered earlier. One big thing about this kind of optimization
problem is of course the gradient. The loss function here has a kink. Thus the derivative is not
continuous in the point x equals one. So there is unclear what the derivative is and now you could say
okay I can't compute the derivative of this function so I'm doomed. Luckily subgradients save the day.
So let's introduce this concept and in order to do so we have a look at convex differentiable functions.
On those we can say that at any point f of x we can essentially find a lower bound of f of x
that is indicated by some x of x zero plus the gradient at f of x zero multiplied with the
difference from x to x zero. So let's look at a graph to show this concept. If you look at this
function here you can see that I can take any point x zero and compute the gradient or in this case
it's simply the tangent that is constructed. By doing so you will see that any point of the
tangent will be a lower bound to the entire function. It doesn't matter where I take this point
if I follow the tangent in its direction I'm always constructing a lower bound. Now this kind
of definition is much more suited for us. So let's expand now on the gradient and go into
the direction of subgradients. In subgradients we define something which keeps this property
but is not necessarily a gradient. So a vector g is a subgradient of a convex function f of x zero
if we have the same property. So if we follow the subgradient direction multiply it with the
difference between x and x zero then we always have a lower bound. The nice thing with this is
that we essentially can relax the requirement of being able to compute a gradient. There could be
multiple of those g's that fulfill this property so g is not required to be unique. The set of all
these subgradients is then called the subdifferential. So the subdifferential is then a set of subgradients
Presenters
Zugänglich über
Offener Zugang
Dauer
00:18:52 Min
Aufnahmedatum
2020-10-10
Hochgeladen am
2020-10-10 14:06:19
Sprache
en-US
Deep Learning - Loss and Optimization Part 2
This video explains hinge loss and its relation to support vector machines. It enables to embed optimization constraints into loss functions.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning
User Loss