Increase the power of representation that we can do with a couple of neurons.
So you've seen that we can map, for example, a decision tree only with two hidden layers entirely.
So this was very nice.
And then we started to looking into the backpropagation algorithm, how this network is actually trained.
We've seen that backpropagation is essentially a method of very efficiently computing the gradient with respect to the weights.
We've seen this local representation, how the different streams pass through the neuron,
and how the gradients look like with respect to the local function and how they are passed through the network.
And then we've seen the full view, where we only looked into essentially linear problems,
where we had matrix representations, and you could see that essentially the derivatives with respect to different sets of parameters
in our objective function then led very nicely to the derivatives of the different layers in a network.
So now we've seen that we want to somehow optimize this.
We know how to compute the gradient, and now we want to look a little more into how to construct loss functions and how to actually optimize.
So these will be today's two topics.
So the loss functions, we essentially have two typical tasks that we are dealing with.
So this is the typical pattern recognition approach that you either want to classify,
which means that you want to find a decision boundary, and this decision boundary is then able to separate your data into classes,
and we typically have images with two classes, but generally we can also apply this to multi-class problems, and we will see that in a bit.
And then the other approach is that we want to regress.
So we have one variable, and we want to predict another variable.
We have a set of variables, and we want to predict another set of variables.
And then there's a typical regression problem.
So here on the left-hand side, you see the classification and the boundary, and on the right-hand side, you see this regression.
So I have this variable x2, and I want to predict x1, and these are my observations,
and I want to find a model that is able to translate them one into the other.
And obviously, there is different loss functions that are more or less suited for specific tasks, which we will discuss shortly.
So there's an important difference between the loss function and the last activation function.
So you remember we had these neurons, and every neuron had an activation function,
and this activation function you could take sigmoids, or we had this rectified linear unit that have this kind of hint shape,
and they are applied every time.
So this activation function is always there, and it's applied on individual samples of the batch, and it's present at training and testing time.
So the activation function doesn't change. It's a part of the network.
So it's part of the prediction.
The loss function is only present when we do the training.
The loss function tells us how good a prediction is, and it combines all of our samples in the training data,
and we only use it in the training to see how good a set of parameters is.
And with this set of parameters, so we can assess the value of a set of parameters,
and it allows us then to compute the gradient with respect to a loss.
So the loss function generally produces a scalar value that is a kind of quality measure.
Okay, so let's go a bit back and consider maximum likelihood estimations.
So generally we can describe our problem with a conditional probability density function,
and we have a general set of training observations, x.
There's m different observations in the training set, and this is a vector of input data,
just so we don't specify even the dimension here. So there's an arbitrary dimension, but a fixed one.
And then we also have associated labels, and here we also denote them as vectors,
because we consider them one-hot encoded, as discussed in the last lecture.
So we have essentially one entry for every class that we wish to predict.
Now the probability to observe a certain label, given an observation, we can express as P of y given x,
and then the joint probability of all our labels can be computed as a product of the two
if they're independent and identically distributed.
So we assume here that all of our observations and labels are identically distributed,
Presenters
Zugänglich über
Offener Zugang
Dauer
01:07:21 Min
Aufnahmedatum
2018-04-25
Hochgeladen am
2018-04-25 16:49:09
Sprache
en-US