Welcome back everybody to Deep Learning. So today we want to talk about again feed-forward
networks and in this fourth part the main focus will be on layer abstraction. Of course
we talked about those neurons and individual nodes but this grows really complex for larger
networks. So we want to introduce this layering concept also in our computation of the gradients
and this is really useful because we can then talk directly about gradients on entire layers
and don't need to go towards all of the different nodes. So how do we express this? Let's recall
what our single neuron is doing. The single neuron is computing essentially an inner product
of its weights. By the way, we are skipping over the bias in this notation. We are expanding
our vector by an additional one element. This allows us to describe the bias and also the
inner product simply as shown on the slide here. This is really nice because then you
can see that the output y hat is just an inner product.
Now think about the case that we have m neurons, which means that we get some y hat index m.
All of them are inner products. So if you bring this into a vector notation, you can
see that the vector y hat is nothing else than a matrix multiplication of x with this
matrix W. You see that a fully connected layer is nothing else than a matrix multiplication.
So we can essentially represent arbitrary connections and topologies using this fully
connected layer. Then we can also apply a point-wise non-linearity such that we get
the non-linear effect. The nice thing about matrix notation is of course that we can describe
now the entire layer derivatives using matrix calculus.
So for our fully connected layer, we would then get the following configuration. Three
elements for the input and then weights for every neuron. Let's say you have two neurons.
Then we get these weight vectors. We multiply the two with x. In the forward pass, we determine
this y hat for the entire module using a matrix. If you want to compute the gradients, then
we need exactly two partial derivatives. These are exactly the same ones as we already mentioned.
We need the derivative with respect to the weights and we need the derivative with respect
to the inputs. The derivative with respect to the weight is going to be the partial derivative
with respect to W and the partial derivative with respect to x is the derivative with respect
to the inputs.
So how do we compute this? Well, we have the layer that is y hat equals to w x. So there's
a matrix multiplication in the forward pass. Then we need the derivative with respect to
the weights. Now we can see that what we essentially need to do is to compute a matrix derivative
here. The derivative of y hat with respect to w is going to be simply x transpose. So
if we have the loss that comes into our module, the update to our weights is going to be this
loss vector multiplied with x transpose. So we have some loss vector and x transpose,
which essentially means that you have an outer product. One is a column vector and the other
one is a row vector because of the transpose. So if you multiply the two, you will end up
with a matrix. The above partial derivative with respect to w will always result in a
matrix. If you look at the bottom row, you need the partial derivative of y hat with
respect to x. Also something you can find in the matrix cookbook, by the way. It is
a very useful tool and I'll provide the reference in the description of this video. You will
find all kinds of matrix derivatives in this one. So if you do that, you can see that for
the above equation, the partial with respect to x is going to be w transpose. Now you have
w transpose multiplied again with some loss vector. This loss vector times a matrix is
going to be a vector again. This is the vector that you will need to pass on in the back
propagation process towards the next higher layer. Okay, let's look into some example.
We have some simple example first and then a multi-layer example next. So the simple
example is going to be the same network as we had it already. So this is a network without
any non-linearity w x. Now we need some loss function and here we don't take cross entropy
but we take the L2 loss which is also a very common vector norm. What it does, it simply
takes the output of the network, subtract the desired output and computes the L2 norm.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:17:02 Min
Aufnahmedatum
2020-10-10
Hochgeladen am
2020-10-10 09:26:19
Sprache
en-US
Deep Learning - Feedforward Networks Part 4
This video explains backpropagation at the level of layer abstraction.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning
References
[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, inc., 2000.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[3] F. Rosenblatt. “The perceptron: A probabilistic model for information storage and organization in the brain.” In: Psychological Review 65.6 (1958), pp. 386–408.
[4] WS. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in nervous activity.” In: Bulletin of mathematical biophysics 5 (1943), pp. 99–115.
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by back-propagating errors.” In: Nature 323 (1986), pp. 533–536.
[6] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Rectifier Neural Networks”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence Vol. 15. 2011, pp. 315–323.
[7] William H. Press, Saul A. Teukolsky, William T. Vetterling, et al. Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed. New York, NY, USA: Cambridge University Press, 2007.