10 - Deep Learning - Loss and Optimization Part 1 [ID:16870]

50 von 134 angezeigt

Welcome everybody to deep learning. So today we want to continue talking about the different

loss and optimization and we want to go ahead and talk a bit about more details of these

interesting problems. So let's talk first about the loss functions.

So loss functions are generally used for different tasks and for different tasks you'd have

different loss functions. And the two most important ones that we are facing are regression

and classification. So in classification you want to estimate a discrete variable for every

input. So this means that you want to essentially decide in this two class problem here on the

left whether it's blue or red dots. So you need to model a decision boundary. In regression

the idea is that you want to model a function that explains your data. So you have some

input function let's say x2 and you want to predict x1 from it. So you compute a function

that will produce the appropriate value of x1 for any given x2. Here in this example

you can see this is a line fit. So we talked about activation functions, last activation,

softmax, cross entropy loss and somehow we combined them. And obviously there's a difference

between the last activation function in a network and the loss function. Because the

last activation function is applied on the individual samples xm of each of the batch

and it will also be present at training and testing time. So the last activation function

will become part of the network and will remain there, produces the output or the prediction

and generally produces a vector. Now the loss function combines all m samples and labels

and in their combination they produce a loss that describes how good the fit is. So it's

only present during the training time and this loss is generally a scalar value you

only needed during training time. Interestingly many of those loss functions can be put in

a probabilistic framework and this leads us then to maximum likelihood estimation. And

maximum likelihood estimation just as a reminder we then consider everything to be probabilistic

so we have a set of observations capital X that consists of the individual observations.

Then we have the associated labels so they also stem from some distribution and the observations

are then y1 to ym. And then of course we need a conditional probability density function

that describes us somehow how y and x are in relation. In particular we can compute

the probability for y given some observation x that will be then very useful if we want

to decide for a specific class for example. Now we have to somehow model this data set.

They are given from some distribution so they are drawn for some distribution and the joint

probability for the given data set can then be produced as a product over the individual

conditional probabilities. And of course if they are independent and identically distributed

then you can simply write this up as a large product over the entire training data set.

So you end up with this product over all m samples where it's just a product of the conditionals.

This is useful because then we can essentially determine the best parameters by maximizing

the joint probability over the entire training data set and we have to do it by evaluating

this large product. Now this large product has a couple of problems in particular if

we have higher and lower values then they may cancel out very quickly. So it may be

interesting to transform the entire problem into log domain and this is really interesting

because the logarithm is a monotonous transformation and doesn't change the position of the maximum.

So we can now use the log function and a negative sign and then flip the maximization into a

minimization and we can show that this is equivalent instead of looking at the likelihood

function we can look at the negative log likelihood function and then our large product is suddenly

a sum over all the observations times the negative logarithm of the conditional probabilities.

Now we can look at the univariate Gaussian model so now we are one dimensional again

and we can model this with a normal distribution where we would then choose the output of our

network as the expected value and the one over some variable beta as the standard deviation.

If we do so we can find the following formulation we get square root of beta over square root

of 2 pi times the exponential function to the power of beta minus the label minus the

prediction to the power of 2 divided by 2. Okay so let's go ahead and put this into our

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:14:48 Min

Aufnahmedatum

2020-05-29

Hochgeladen am

2020-05-29 19:26:36

Sprache

en-US

Deep Learning - Loss and Optimization Part 1

This video explains how to derive L2 Loss and Cross-Entropy Loss from statistical assumptions. Highly relevant for the oral exam!

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren