7 - Deep Learning - Feedforward Networks Part 2 [ID:13483]
50 von 101 angezeigt

Welcome to deep learning. So in this little video we want to go ahead and look into some

basic functions of neural networks and in particular we want to look into the softmax function

and look into some ideas how we could potentially train the deep networks.

We have a search technique which is just local search gradient descent to try to find a program

that is running on these recurrent networks such that it can solve some interesting problems such

as speech recognition or machine translation and something like that. Okay so let's start.

Activation functions for classifications. Now so far we have described the ground truth by labels

minus one plus one but of course we could also have classes zero one. So this is really only a

thing of definition if we only do a decision between two classes. But if you want to go into

more complex cases you want to be able to classify multiple classes. So in this case you probably want

to have an output vector that doesn't show any pictures and here you have essentially one

dimension per class k. So k capital k here is the number of classes and you can then define a ground

truth representation as a vector that has all zeros except for one position and that is the true class.

So this is also called one-hot encoding because all of the other parts of the vector are zero

and only a single one has a one of the world in vector space. But I think this is very difficult

normal people to understand. They would not know what they're looking at. And now you try to compute

a classifier that will produce a respective vector and with this vector y hat you can then go ahead

and do the classification. So it's essentially like guessing a output probability for each of the

classes. In particular for multi-class problems this has been shown to be a more efficient way

of computing these problems. Now the problem is you want to have a kind of probabilistic output

towards zero and one but we typically have some arbitrary input vector x and that could be

arbitrarily scaled. So in order to produce now our predictions we employ a trick and the trick is

that we use the exponential function. So this is very nice because the exponential function will

map everything into a positive space and now you want to make sure that the maximum that can be

achieved is exactly one. So you do that for all of your classes. So you compute the sum over all of

the exponentials of all input vectors or of all input elements, use the exponential function on

them, sum them up and this gives you the maximum that can be attained by this conversion and you

divide by this number for all of your given inputs and this will always scale to a zero one domain

and it will have the property that if you sum up all elements of the vector it will equal to one.

This is very nice because these are two axioms of the probability distribution introduced by

Kolmogorov. So this allows us to treat the output of the network always as kind of

probabilities. And that was my 1987 diploma thesis which was all about that.

And if you look in literature or also in software examples sometimes the softmax function is also

known as the normalized exponential function. So it's the same thing. Now let's look at an example.

So let's say this is our input to our neural network. So you see this small image on the left.

Now you introduce labels for this three class problem. Wait there's something missing.

It's a four class problem. So you introduce labels for this four class problem

and then you have some arbitrary input that is shown here in the column xk. So they are scaled from

minus 3.44 to 3.91. This is not so great so let's use the exponential function. Now everything is

mapped into positive numbers and there's quite a difference now between the numbers. So we need to

rescale them and you can see the highest probability is of course returned for heavy metal in this image.

So let's go ahead and also talk a bit about loss functions. So the loss function is a kind of

function that tells you how good the prediction of a network is. And a very typical one is the so

called cross entropy loss and it's the cross entropy that is computed between two probability

distributions. So you have your ground truth and your probability distribution. So you have your

probability distribution. So you have your ground truth distribution and the one that you're

estimating and then you can compute the cross entropy in order to determine how well they are

connected, how well they align with each other. And then you can also use this into a loss function.

Here we can use the property that all of our elements will be zero except for the true class.

So we only have to determine the negative logarithm of y hat k where k is the true class.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:12:04 Min

Aufnahmedatum

2020-04-18

Hochgeladen am

2020-04-18 10:06:06

Sprache

en-US

Deep Learning - Feedforward Networks Part 2

This video introduces the topics of activation functions, loss, and the idea of gradient descent.

Video References:
Lex Fridman's Channel

Music Reference:
The One They Fear - The Dawn

Further Reading:
A gentle Introduction to Deep Learning 

 

Tags

artificial intelligence deep learning machine learning pattern recognition Feedforward Networks Gradient descent activation functions
Einbetten
Wordpress FAU Plugin
iFrame
Teilen