Welcome everybody to deep learning. So today we want to continue talking about the different
loss and optimization and we want to go ahead and talk a bit about more details of these
interesting problems. So let's talk first about the loss functions.
So loss functions are generally used for different tasks and for different tasks you'd have
different loss functions. And the two most important ones that we are facing are regression
and classification. So in classification you want to estimate a discrete variable for every
input. So this means that you want to essentially decide in this two class problem here on the
left whether it's blue or red dots. So you need to model a decision boundary. In regression
the idea is that you want to model a function that explains your data. So you have some
input function let's say x2 and you want to predict x1 from it. So you compute a function
that will produce the appropriate value of x1 for any given x2. Here in this example
you can see this is a line fit. So we talked about activation functions, last activation,
softmax, cross entropy loss and somehow we combined them. And obviously there's a difference
between the last activation function in a network and the loss function. Because the
last activation function is applied on the individual samples xm of each of the batch
and it will also be present at training and testing time. So the last activation function
will become part of the network and will remain there, produces the output or the prediction
and generally produces a vector. Now the loss function combines all m samples and labels
and in their combination they produce a loss that describes how good the fit is. So it's
only present during the training time and this loss is generally a scalar value you
only needed during training time. Interestingly many of those loss functions can be put in
a probabilistic framework and this leads us then to maximum likelihood estimation. And
maximum likelihood estimation just as a reminder we then consider everything to be probabilistic
so we have a set of observations capital X that consists of the individual observations.
Then we have the associated labels so they also stem from some distribution and the observations
are then y1 to ym. And then of course we need a conditional probability density function
that describes us somehow how y and x are in relation. In particular we can compute
the probability for y given some observation x that will be then very useful if we want
to decide for a specific class for example. Now we have to somehow model this data set.
They are given from some distribution so they are drawn for some distribution and the joint
probability for the given data set can then be produced as a product over the individual
conditional probabilities. And of course if they are independent and identically distributed
then you can simply write this up as a large product over the entire training data set.
So you end up with this product over all m samples where it's just a product of the conditionals.
This is useful because then we can essentially determine the best parameters by maximizing
the joint probability over the entire training data set and we have to do it by evaluating
this large product. Now this large product has a couple of problems in particular if
we have higher and lower values then they may cancel out very quickly. So it may be
interesting to transform the entire problem into log domain and this is really interesting
because the logarithm is a monotonous transformation and doesn't change the position of the maximum.
So we can now use the log function and a negative sign and then flip the maximization into a
minimization and we can show that this is equivalent instead of looking at the likelihood
function we can look at the negative log likelihood function and then our large product is suddenly
a sum over all the observations times the negative logarithm of the conditional probabilities.
Now we can look at the univariate Gaussian model so now we are one dimensional again
and we can model this with a normal distribution where we would then choose the output of our
network as the expected value and the one over some variable beta as the standard deviation.
If we do so we can find the following formulation we get square root of beta over square root
of 2 pi times the exponential function to the power of beta minus the label minus the
prediction to the power of 2 divided by 2. Okay so let's go ahead and put this into our
Presenters
Zugänglich über
Offener Zugang
Dauer
00:14:48 Min
Aufnahmedatum
2020-05-29
Hochgeladen am
2020-05-29 19:26:36
Sprache
en-US
Deep Learning - Loss and Optimization Part 1
This video explains how to derive L2 Loss and Cross-Entropy Loss from statistical assumptions. Highly relevant for the oral exam!
Further Reading:
A gentle Introduction to Deep Learning