9 - Deep Learning - Plain Version 2020 [ID:21051]
50 von 166 angezeigt

Welcome back everybody to Deep Learning. So today we want to talk about again feed-forward

networks and in this fourth part the main focus will be on layer abstraction. Of course

we talked about those neurons and individual nodes but this grows really complex for larger

networks. So we want to introduce this layering concept also in our computation of the gradients

and this is really useful because we can then talk directly about gradients on entire layers

and don't need to go towards all of the different nodes. So how do we express this? Let's recall

what our single neuron is doing. The single neuron is computing essentially an inner product

of its weights. By the way, we are skipping over the bias in this notation. We are expanding

our vector by an additional one element. This allows us to describe the bias and also the

inner product simply as shown on the slide here. This is really nice because then you

can see that the output y hat is just an inner product.

Now think about the case that we have m neurons, which means that we get some y hat index m.

All of them are inner products. So if you bring this into a vector notation, you can

see that the vector y hat is nothing else than a matrix multiplication of x with this

matrix W. You see that a fully connected layer is nothing else than a matrix multiplication.

So we can essentially represent arbitrary connections and topologies using this fully

connected layer. Then we can also apply a point-wise non-linearity such that we get

the non-linear effect. The nice thing about matrix notation is of course that we can describe

now the entire layer derivatives using matrix calculus.

So for our fully connected layer, we would then get the following configuration. Three

elements for the input and then weights for every neuron. Let's say you have two neurons.

Then we get these weight vectors. We multiply the two with x. In the forward pass, we determine

this y hat for the entire module using a matrix. If you want to compute the gradients, then

we need exactly two partial derivatives. These are exactly the same ones as we already mentioned.

We need the derivative with respect to the weights and we need the derivative with respect

to the inputs. The derivative with respect to the weight is going to be the partial derivative

with respect to W and the partial derivative with respect to x is the derivative with respect

to the inputs.

So how do we compute this? Well, we have the layer that is y hat equals to w x. So there's

a matrix multiplication in the forward pass. Then we need the derivative with respect to

the weights. Now we can see that what we essentially need to do is to compute a matrix derivative

here. The derivative of y hat with respect to w is going to be simply x transpose. So

if we have the loss that comes into our module, the update to our weights is going to be this

loss vector multiplied with x transpose. So we have some loss vector and x transpose,

which essentially means that you have an outer product. One is a column vector and the other

one is a row vector because of the transpose. So if you multiply the two, you will end up

with a matrix. The above partial derivative with respect to w will always result in a

matrix. If you look at the bottom row, you need the partial derivative of y hat with

respect to x. Also something you can find in the matrix cookbook, by the way. It is

a very useful tool and I'll provide the reference in the description of this video. You will

find all kinds of matrix derivatives in this one. So if you do that, you can see that for

the above equation, the partial with respect to x is going to be w transpose. Now you have

w transpose multiplied again with some loss vector. This loss vector times a matrix is

going to be a vector again. This is the vector that you will need to pass on in the back

propagation process towards the next higher layer. Okay, let's look into some example.

We have some simple example first and then a multi-layer example next. So the simple

example is going to be the same network as we had it already. So this is a network without

any non-linearity w x. Now we need some loss function and here we don't take cross entropy

but we take the L2 loss which is also a very common vector norm. What it does, it simply

takes the output of the network, subtract the desired output and computes the L2 norm.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:17:02 Min

Aufnahmedatum

2020-10-10

Hochgeladen am

2020-10-10 09:26:19

Sprache

en-US

Deep Learning - Feedforward Networks Part 4

This video explains backpropagation at the level of layer abstraction.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning 

References
[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, inc., 2000.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[3] F. Rosenblatt. “The perceptron: A probabilistic model for information storage and organization in the brain.” In: Psychological Review 65.6 (1958), pp. 386–408.
[4] WS. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in nervous activity.” In: Bulletin of mathematical biophysics 5 (1943), pp. 99–115.
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by back-propagating errors.” In: Nature 323 (1986), pp. 533–536.
[6] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Rectifier Neural Networks”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence Vol. 15. 2011, pp. 315–323.
[7] William H. Press, Saul A. Teukolsky, William T. Vetterling, et al. Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed. New York, NY, USA: Cambridge University Press, 2007.

Einbetten
Wordpress FAU Plugin
iFrame
Teilen