Welcome back to deep learning. So in this little video we want to go ahead and look into some basic functions of neural networks.
In particular, we want to look into the softmax function and look into some ideas on how we would potentially train deep networks.
Now, so far we have described the ground truth labels by minus one and plus one.
But of course we could also have some classes zero and one.
This is really only a matter of definition if we do only a decision between two classes.
But if you want to go into more complex cases, you want to be able to classify multiple classes.
So in this case you probably want to have an output vector.
Here you have essentially one dimension per class k, where k is the number of classes.
You can then define a ground truth representation as a vector that has all zeros except for one position and this is the true class.
So this is also called one-hot encoding because all of the other parts of the vector are zero, only a single one has a one.
Now you try to compute a classifier that will produce our respective vector and with this vector y hat you can then go ahead and do the classification.
So it's essentially like I'm guessing an output probability for each of the classes.
In particular for multi-class problems this has been shown to be a more efficient way of attacking these problems.
The problem is that you want to have a kind of a probabilistic output towards zero and one,
but we typically have some input vector x that could be arbitrarily scaled.
So in order to produce now our predictions we employ a trick.
The trick is that we use the exponential function as it will map everything into a positive space.
Now you want to make sure that the maximum can be achieved and it's exactly one.
So you do that for all of your classes and compute the sum over all of the exponentials of all input elements.
This gives you the maximum that can be attained by this conversion.
Then you divide this by the number for all of your given inputs and this will always scale to a zero, one domain.
The resulting vector will also have the property that if you sum up all elements it will equal to one.
These are two axioms of the probability distribution introduced by Kolmogorov.
So this allows us to treat the output of the network always as a kind of probability.
If you look in literature, also in software examples, sometimes the softmax function is also known as the normalized exponential function.
It's the same thing.
Now let's look into an example.
So let's say this is our input to our neural network.
You see the small image on the left.
Now you introduce labels for this free class problem.
Wait, there's something missing.
It's a four class problem.
So you introduce labels for this four class problem.
Then you have some arbitrary input that is shown here in the column X superscript K.
So they are scaled from minus 3.44 to 3.01.
This is not so great.
Let's use the exponential function.
Now everything is mapped into positive numbers and there's quite a difference now between the numbers.
So we need to rescale them.
And now you can see that the highest probability is of course returned for heavy metal in this case.
So let's go ahead and also talk a bit about loss functions.
So the loss function is a kind of function that tells us how good the prediction of a network is.
A very typical one is the so-called cross entropy loss.
The cross entropy is computed between two probability distributions.
So if you have your ground truth distribution and the one that you're estimating,
then you can compute the cross entropy in order to determine how well they are connected.
It is how well they align with each other.
Then you can also use this as a loss function.
Here we can use this property that all our elements will be zero except for the true class.
So we only have to determine the negative logarithm of y hat subscript k, where k is the true class.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:03 Min
Aufnahmedatum
2020-10-09
Hochgeladen am
2020-10-09 10:26:21
Sprache
en-US
Deep Learning - Feedforward Networks Part 2
This video introduces the topics of activation functions, loss, and the idea of gradient descent.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning