Welcome everybody to pattern recognition. So today we want to look into multilayer perceptrons
that are also called neural networks. So we'll give a brief sketch of the ideas of neural
networks.
Okay so let's have a look at multilayer perceptrons. You see that we talk about this here only
about very basic concepts. If you're interested in neural networks we have an entire class
on deep learning where we talk about all the details. So here we will stay rather on the
surface. You may know that neural networks are extremely popular because they also have
this physiological motivation and we've seen that the perceptron is essentially computing
a sum of elements that go in. They are weighted in an inner product and some bias and you
could say that this has some relation to neurons because neurons are connected with accents
to other neurons and they are essentially getting the electrical activations from those
other neurons. They are collecting them and once the inputs are greater than a certain
threshold then the neuron is activated and you typically have this zero or one response
so it's either activated or not and it doesn't matter how strong the actual activation is.
If you are above the threshold then you have an output. If you're not there's simply no
output. Now you have these neurons and we don't talk about biological ones here but
we will talk about the mathematical ones based on the perceptron and then we can go ahead
and arrange them in layers and layers on top of each other and we essentially have some
input neurons where we simply have the input feature vector and some bias that we're here
indicating with one and this is then passed in a fully connected approach. So we are essentially
connecting everything with everything and we have then hidden layers and they're hidden
because we somehow cannot observe what is really happening with them. We can only observe
that if we have a given input sample and we know the weights then we can actually compute
what is happening there. If we don't have that then generally we don't see what is happening
but we only see the output at the very end and the output then is observable again and
we have typically a desired output and this desired output can then actually be used to
compare this to the output of the network which allows us then to construct a training
procedure. Note that we are not only doing sums of the input elements that are weighted
but what's also very important is this non-linearity. So we kind of need to model this all or none
response function and we've seen that Rosenblatt originally was using the step function. Of
course we could also use linear functions but if we were using linear functions we will
see that towards the end of this video then everything would essentially collapse down
to a single big matrix multiplication. So actually in every layer if they are fully
connected then you're essentially computing a matrix multiplication of the activations
of the previous layer with the next one. So this can be modeled simply as a matrix. Now
what is not modeled typically as a matrix is the activation function and the activation
function is applied element wise. The step function was this approach as Rosenblatt did
it but in later approaches and classical approaches the following two functions the sigmoid function
so the logistic function was very commonly used and as an alternative also the hyperbolic
tangent was also used because it has some advantages with respect to the optimization.
So we can now write down our units of these networks as essentially sums over the previous
layers. So we have some yi that is the output of the previous layer so we indicate this
here with l minus one. This is then multiplied with some wij and we also have this bias w0j
in the current layer l and this sum is essentially constructing the output already but the net
output is then also run through the activation function f and f is one of the choices that
we see above here so this is introducing the non-linearity that is then producing the output
of the layer l in the respective neuron. Now you want to be able to train this and this
is typically done using the back propagation algorithm. This is a supervised learning procedure
and backpropagation helps you to compute the gradients. So backpropagation is actually
not the learning algorithm itself but it's the way of computing the gradient. Here we
propose to use a gradient descent and we can essentially do that by then updating the previous
weights following the gradient of the weights and we can then essentially determine this
by a partial derivative of a loss function here denoted as epsilon and this is then partially
derived with respect to all of the weights of that particular layer and you can see if
you think in layers you will find that all the weights of the respective layer get very
similar update rules and therefore they can be summarized and we will do that in the following
couple of slides. So typically you choose an error function here we choose the mean
squared error there's many other loss functions also available if you want to see all the
details about other common loss functions I can really recommend to look at the lecture
deep learning. Here you see we have some target tk and then we compare the output of the last
layer and take the square of that and sum up essentially this for all of the outputs
and this gives us the mean square error. Now how can we compute the updates? Well we need
to compute the partial derivatives with respect to the weights that we want to update we will
start with the output layer so here you essentially have the mean square error and we then compute
the partial derivative with respect to the actual last layer weights. If we want to do
that we see we have to apply the chain rule so we are computing the partial derivative
of the error with respect to the net and then the partial derivative of the net with respect
to the weight that we want to update. If we look at this in more detail we can already
see that the partial derivative of the net with respect to the weights is simply the
output of the previous layer. So this is the yj of the layer l minus one which is essentially
the layer before the last layer and then we introduce this delta k of l which is essentially
the so called sensitivity. Now the sensitivity we can look at in a little bit more detail
here we need to compute the partial derivative of the error with respect to the net and again
if we want to look at that we can apply the chain rule so we compute the partial derivative
of the error with respect to the last layer output and then the last layer output we compute
the partial derivative of with respect to the net in that layer and then you can see
that we can essentially write this up as the target value minus the last layer output times
and now we have f prime here that denotes the derivative of the activation function
of the net at the layer l. So this gives us the sensitivity in the last layer. So this
was fairly easy. Now let's look what happens if we go one layer deeper so we want to go
now for the weights of the l minus one layer so the last hidden layer and if we want to
do that we essentially follow the same line and we take the mean square error and compute
the partial derivative with respect to the weights of the last hidden layer. Now we can
see that we apply again the chain rule so we compute the partial derivative of the mean
square error with respect to the last hidden layer outputs and then we can see that we
need to compute the partial derivative of the last hidden layer outputs with respect
to the net and again the net with respect to the weights. Well computing the partial
derivative of the net with respect to the weights will simply give us the input from
the previous layer so this is simply yi of the previous layer so this is l minus two
then again computing the partial derivative of the l minus one layer output with respect
to the net is going to give us the derivative of the activation function of the net and
we still have to compute now the partial derivative of the mean square error with respect to the
output of the l minus one layer. So we can see now that this partial derivative of the
mean square error with respect to the l minus one layer is a little bit more complex because
the mean square error is essentially the difference between all the outputs and compared to the
targets of the power of two and summed up. So in order to compute the next step we apply
again the chain rule so we are computing the partial derivative with respect to the square
and we see this gives the first term so this is the sum over all of the activations in
the last layer compared to the target values and then we still have to multiply this with
the partial derivative of yk of the layer l with respect to the yj of the layer l minus
one. Now again we apply the chain rule here so this then gives us the partial derivative
of yk of layer l with respect to the net of layer l and then we still get the partial
derivative of the net at layer l with respect to the inputs that are created in the l minus
one layer in yj. Now we can again see that we can determine these partial derivatives.
This is again the partial derivative of the activation at the position of the net at layer
l and this is multiplied then with the next partial derivative and again the net with
respect to the previous layer is simply the weights. So if we write this up again as sensitivities
you can see that we essentially have the sum over all the sensitivities multiplied with
the respective weights that we get here as the mean square error back propagated to the
l minus one layer. Now we can use this term again and compute the final gradient update
and here we see then that we get for the sensitivity in any layer l it is simply the derivative
of the activation function of the net times the sum over all the weights multiplied with
the sensitivities of the next layer. So we can essentially then formulate this as a kind
of update rule and the update rule if we want to create the gradient of the weights is then
given as eta times the sensitivities of that layer times the input of the previous layer.
So we kind of get this recursive formulation in order to compute those updates. Now this
is looking at all of the weights on node level and looking at this on node level is extremely
tedious and I think this is a bit hard to follow. This is why I also brought to you
the layer abstraction concept and in the layer abstraction concept we now simply express
the different layers as matrix multiplications and here in this academic example I'm also
skipping over the activation function. So if you want to include the activation functions
then you also need to apply them element wise after each application of the matrix multiplication
here then this would really give you a full net. Here we essentially have three matrices
multiplied to each other. Technically they could also be collapsed to a single matrix
but in order to show you the workflow of the derivations I will use this example. So if
you do that and go ahead then we also need our mean square error. This is now simply
the L2 norm squared of the desired output and the respective input parsed by the network.
So this is essentially then x times w1 times w2 times w3 minus the desired output that
we denote here is y. So this is essentially exactly this mean square error. Note that
I'm mixing a bit the notation but of course this is an advanced class and I think you
can now understand that the y is no longer the output of the net but here it is the desired
target. So if we now go ahead we need to compute the gradients and I'm not going through this
example and really compute all of the gradients but what I want to show you is the general
sketch how the back propagation is then actually applied. So if you want to compute the gradients
then the first thing that you do is you compute the forward pass through the entire network
and then you need to compute the updates with respect to the different layers. So of this
loss function here you compute the partial derivative of the loss function with respect
to w3 w2 and w1. Now w3 is the last layer so what you get in terms of update for w3
is the partial derivative of the loss function with respect to f3 of x. So this is essentially
the output of the net and then you multiply this with the partial derivative of f3 with
respect to w3 and I'm writing the actual computations down here so that you can actually follow
this gradient and now if I go back one layer then I'm essentially reusing the result of
the previous layer partially because if I want to backpropagate for another step then
I essentially need for every of those layers the partial derivative with respect to the
inputs and the partial derivative with respect to the weights. So in the w2 layer this is
then going to be the partial derivative of f3 with respect to f2 and this is simply going
to be w3 transpose and then you also need the partial derivative of f2 with respect
to w2 which is then the input from the previous layer transposed and now if I want to compute
the complete gradient I'm essentially multiplying the partial derivatives following this red
path here and then I get the gradient update for that particular weight. So the deeper
I go the more often I have to multiply with the partial derivatives that are needed to
backpropagate to that particular position and if I want to go to the very first layer
you see that I have in total four multiplicative steps in order to compute the update for the
first layer. We have more of this example in the deep learning lecture so there we go
into more detail also about backpropagation and how to compute the individual steps but
I think this figure here and the node wise computation is sufficient at this point and
you see now that we actually have a physiological background for neural networks. We didn't
talk about the details about the neurons, the synapses and the action potentials but
I will give you references if you're interested in the biological background and this then
gave rise to a topology of multi-layer perceptrons that we arrange in layers and of course in
every layer we apply neuron wise activation functions and if we then want to compute updates
we can use the back propagation algorithm in order to compute the actual gradients and
then we can try gradient descent methods in order to actually perform the training. So
in the next video we want to talk a bit about optimization strategies in more detail and
we will look into some ideas how we can perform optimization and I still have some further
reading for you in particular if you're interested in the physiology I can recommend these two
references they are all very nice if you want to get some understanding of the actual biology
that is associated to the biological neural networks. So I hope you like this small video
and I'm looking forward to seeing you in the next video. Thank you very much for watching
and bye bye.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:19:20 Min
Aufnahmedatum
2020-11-08
Hochgeladen am
2020-11-08 13:47:16
Sprache
en-US
In this video, we have a short introduction to the multi-layer perceptron.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You