24 - Pattern Recognition [PR] - PR 20 [ID:23066]
170 von 170 angezeigt

Welcome everybody to pattern recognition. So today we want to look into multilayer perceptrons

that are also called neural networks. So we'll give a brief sketch of the ideas of neural

networks.

Okay so let's have a look at multilayer perceptrons. You see that we talk about this here only

about very basic concepts. If you're interested in neural networks we have an entire class

on deep learning where we talk about all the details. So here we will stay rather on the

surface. You may know that neural networks are extremely popular because they also have

this physiological motivation and we've seen that the perceptron is essentially computing

a sum of elements that go in. They are weighted in an inner product and some bias and you

could say that this has some relation to neurons because neurons are connected with accents

to other neurons and they are essentially getting the electrical activations from those

other neurons. They are collecting them and once the inputs are greater than a certain

threshold then the neuron is activated and you typically have this zero or one response

so it's either activated or not and it doesn't matter how strong the actual activation is.

If you are above the threshold then you have an output. If you're not there's simply no

output. Now you have these neurons and we don't talk about biological ones here but

we will talk about the mathematical ones based on the perceptron and then we can go ahead

and arrange them in layers and layers on top of each other and we essentially have some

input neurons where we simply have the input feature vector and some bias that we're here

indicating with one and this is then passed in a fully connected approach. So we are essentially

connecting everything with everything and we have then hidden layers and they're hidden

because we somehow cannot observe what is really happening with them. We can only observe

that if we have a given input sample and we know the weights then we can actually compute

what is happening there. If we don't have that then generally we don't see what is happening

but we only see the output at the very end and the output then is observable again and

we have typically a desired output and this desired output can then actually be used to

compare this to the output of the network which allows us then to construct a training

procedure. Note that we are not only doing sums of the input elements that are weighted

but what's also very important is this non-linearity. So we kind of need to model this all or none

response function and we've seen that Rosenblatt originally was using the step function. Of

course we could also use linear functions but if we were using linear functions we will

see that towards the end of this video then everything would essentially collapse down

to a single big matrix multiplication. So actually in every layer if they are fully

connected then you're essentially computing a matrix multiplication of the activations

of the previous layer with the next one. So this can be modeled simply as a matrix. Now

what is not modeled typically as a matrix is the activation function and the activation

function is applied element wise. The step function was this approach as Rosenblatt did

it but in later approaches and classical approaches the following two functions the sigmoid function

so the logistic function was very commonly used and as an alternative also the hyperbolic

tangent was also used because it has some advantages with respect to the optimization.

So we can now write down our units of these networks as essentially sums over the previous

layers. So we have some yi that is the output of the previous layer so we indicate this

here with l minus one. This is then multiplied with some wij and we also have this bias w0j

in the current layer l and this sum is essentially constructing the output already but the net

output is then also run through the activation function f and f is one of the choices that

we see above here so this is introducing the non-linearity that is then producing the output

of the layer l in the respective neuron. Now you want to be able to train this and this

is typically done using the back propagation algorithm. This is a supervised learning procedure

and backpropagation helps you to compute the gradients. So backpropagation is actually

not the learning algorithm itself but it's the way of computing the gradient. Here we

propose to use a gradient descent and we can essentially do that by then updating the previous

weights following the gradient of the weights and we can then essentially determine this

by a partial derivative of a loss function here denoted as epsilon and this is then partially

derived with respect to all of the weights of that particular layer and you can see if

you think in layers you will find that all the weights of the respective layer get very

similar update rules and therefore they can be summarized and we will do that in the following

couple of slides. So typically you choose an error function here we choose the mean

squared error there's many other loss functions also available if you want to see all the

details about other common loss functions I can really recommend to look at the lecture

deep learning. Here you see we have some target tk and then we compare the output of the last

layer and take the square of that and sum up essentially this for all of the outputs

and this gives us the mean square error. Now how can we compute the updates? Well we need

to compute the partial derivatives with respect to the weights that we want to update we will

start with the output layer so here you essentially have the mean square error and we then compute

the partial derivative with respect to the actual last layer weights. If we want to do

that we see we have to apply the chain rule so we are computing the partial derivative

of the error with respect to the net and then the partial derivative of the net with respect

to the weight that we want to update. If we look at this in more detail we can already

see that the partial derivative of the net with respect to the weights is simply the

output of the previous layer. So this is the yj of the layer l minus one which is essentially

the layer before the last layer and then we introduce this delta k of l which is essentially

the so called sensitivity. Now the sensitivity we can look at in a little bit more detail

here we need to compute the partial derivative of the error with respect to the net and again

if we want to look at that we can apply the chain rule so we compute the partial derivative

of the error with respect to the last layer output and then the last layer output we compute

the partial derivative of with respect to the net in that layer and then you can see

that we can essentially write this up as the target value minus the last layer output times

and now we have f prime here that denotes the derivative of the activation function

of the net at the layer l. So this gives us the sensitivity in the last layer. So this

was fairly easy. Now let's look what happens if we go one layer deeper so we want to go

now for the weights of the l minus one layer so the last hidden layer and if we want to

do that we essentially follow the same line and we take the mean square error and compute

the partial derivative with respect to the weights of the last hidden layer. Now we can

see that we apply again the chain rule so we compute the partial derivative of the mean

square error with respect to the last hidden layer outputs and then we can see that we

need to compute the partial derivative of the last hidden layer outputs with respect

to the net and again the net with respect to the weights. Well computing the partial

derivative of the net with respect to the weights will simply give us the input from

the previous layer so this is simply yi of the previous layer so this is l minus two

then again computing the partial derivative of the l minus one layer output with respect

to the net is going to give us the derivative of the activation function of the net and

we still have to compute now the partial derivative of the mean square error with respect to the

output of the l minus one layer. So we can see now that this partial derivative of the

mean square error with respect to the l minus one layer is a little bit more complex because

the mean square error is essentially the difference between all the outputs and compared to the

targets of the power of two and summed up. So in order to compute the next step we apply

again the chain rule so we are computing the partial derivative with respect to the square

and we see this gives the first term so this is the sum over all of the activations in

the last layer compared to the target values and then we still have to multiply this with

the partial derivative of yk of the layer l with respect to the yj of the layer l minus

one. Now again we apply the chain rule here so this then gives us the partial derivative

of yk of layer l with respect to the net of layer l and then we still get the partial

derivative of the net at layer l with respect to the inputs that are created in the l minus

one layer in yj. Now we can again see that we can determine these partial derivatives.

This is again the partial derivative of the activation at the position of the net at layer

l and this is multiplied then with the next partial derivative and again the net with

respect to the previous layer is simply the weights. So if we write this up again as sensitivities

you can see that we essentially have the sum over all the sensitivities multiplied with

the respective weights that we get here as the mean square error back propagated to the

l minus one layer. Now we can use this term again and compute the final gradient update

and here we see then that we get for the sensitivity in any layer l it is simply the derivative

of the activation function of the net times the sum over all the weights multiplied with

the sensitivities of the next layer. So we can essentially then formulate this as a kind

of update rule and the update rule if we want to create the gradient of the weights is then

given as eta times the sensitivities of that layer times the input of the previous layer.

So we kind of get this recursive formulation in order to compute those updates. Now this

is looking at all of the weights on node level and looking at this on node level is extremely

tedious and I think this is a bit hard to follow. This is why I also brought to you

the layer abstraction concept and in the layer abstraction concept we now simply express

the different layers as matrix multiplications and here in this academic example I'm also

skipping over the activation function. So if you want to include the activation functions

then you also need to apply them element wise after each application of the matrix multiplication

here then this would really give you a full net. Here we essentially have three matrices

multiplied to each other. Technically they could also be collapsed to a single matrix

but in order to show you the workflow of the derivations I will use this example. So if

you do that and go ahead then we also need our mean square error. This is now simply

the L2 norm squared of the desired output and the respective input parsed by the network.

So this is essentially then x times w1 times w2 times w3 minus the desired output that

we denote here is y. So this is essentially exactly this mean square error. Note that

I'm mixing a bit the notation but of course this is an advanced class and I think you

can now understand that the y is no longer the output of the net but here it is the desired

target. So if we now go ahead we need to compute the gradients and I'm not going through this

example and really compute all of the gradients but what I want to show you is the general

sketch how the back propagation is then actually applied. So if you want to compute the gradients

then the first thing that you do is you compute the forward pass through the entire network

and then you need to compute the updates with respect to the different layers. So of this

loss function here you compute the partial derivative of the loss function with respect

to w3 w2 and w1. Now w3 is the last layer so what you get in terms of update for w3

is the partial derivative of the loss function with respect to f3 of x. So this is essentially

the output of the net and then you multiply this with the partial derivative of f3 with

respect to w3 and I'm writing the actual computations down here so that you can actually follow

this gradient and now if I go back one layer then I'm essentially reusing the result of

the previous layer partially because if I want to backpropagate for another step then

I essentially need for every of those layers the partial derivative with respect to the

inputs and the partial derivative with respect to the weights. So in the w2 layer this is

then going to be the partial derivative of f3 with respect to f2 and this is simply going

to be w3 transpose and then you also need the partial derivative of f2 with respect

to w2 which is then the input from the previous layer transposed and now if I want to compute

the complete gradient I'm essentially multiplying the partial derivatives following this red

path here and then I get the gradient update for that particular weight. So the deeper

I go the more often I have to multiply with the partial derivatives that are needed to

backpropagate to that particular position and if I want to go to the very first layer

you see that I have in total four multiplicative steps in order to compute the update for the

first layer. We have more of this example in the deep learning lecture so there we go

into more detail also about backpropagation and how to compute the individual steps but

I think this figure here and the node wise computation is sufficient at this point and

you see now that we actually have a physiological background for neural networks. We didn't

talk about the details about the neurons, the synapses and the action potentials but

I will give you references if you're interested in the biological background and this then

gave rise to a topology of multi-layer perceptrons that we arrange in layers and of course in

every layer we apply neuron wise activation functions and if we then want to compute updates

we can use the back propagation algorithm in order to compute the actual gradients and

then we can try gradient descent methods in order to actually perform the training. So

in the next video we want to talk a bit about optimization strategies in more detail and

we will look into some ideas how we can perform optimization and I still have some further

reading for you in particular if you're interested in the physiology I can recommend these two

references they are all very nice if you want to get some understanding of the actual biology

that is associated to the biological neural networks. So I hope you like this small video

and I'm looking forward to seeing you in the next video. Thank you very much for watching

and bye bye.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:19:20 Min

Aufnahmedatum

2020-11-08

Hochgeladen am

2020-11-08 13:47:16

Sprache

en-US

In this video, we have a short introduction to the multi-layer perceptron.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen