Welcome back to deep learning and today we want to talk a little bit more about recurrent neural networks and in particular look into the training procedure.
Recurrent neural networks? You can write them down in five lines of pseudocode.
Okay, so how does RNN training work? Let's look at a simple example and we start with a character level language model.
So we want to learn a character probability distribution from an input text and our vocabulary is going to be very easy.
It's going to be the letters H, E, L and O and we'll encode them as one-hot vectors which then gives us for example for H the vector 1 0 0 0.
Now we can go ahead and train an RNN on the sequence hello and we should learn that given H as the first input the networks should generate the sequence hello.
Now the network needs to know previous inputs when presented with an L because it needs to know whether it needs to generate an L or an O.
It's the same input but two different outputs so you have to know the context.
Let's look at this example and here you can already see how the decoding takes place.
So we put in essentially on the input layer again as one-hot encoded vectors the inputs then we produce the hidden state H, T with the matrices that we've seen previously and produce outputs.
And you can see now we feed in the different letters and this then produces some outputs that can then be mapped via the one-hot encoding back to letters.
So this gives us essentially the possibility to run over the entire sequence and produce the desired outputs and now for the training well the problem is how can we determine all of these weights.
And of course we want to maximize these weights with respect to predicting the correct component.
And this all can be achieved with the back propagation through time algorithm and the idea is we train on the unfolded network.
So here's a short sketch on how to do this.
So the idea is that we unfold the network so we compute the forward pass for the full sequence and then we can apply the loss.
So we essentially then back propagate over the entire sequence such that even things that happen in the very last state can have influence on the very beginning.
So we compute the backward pass through the full sequence to get the gradients and then the weight update.
So for one update with back propagation through time I have to unroll this complete network that then is generated by the input sequence and then I can compare the output that was created with the desired output and compute the update.
So let's look at this in a little bit more detail.
The forward pass is of course just the computation of the hidden states and the output.
So we know that we have some input sequence that is the X1 to Xt where t is the sequence length.
And now I just repeat update our ut which is the linear part before the respective activation function and then we compute the activation function to get our new hidden state.
Then we compute the ot which is essentially the linear part before the sigmoid function and then we apply the sigmoid to produce the y hat that is essentially the output of our network.
If we do so then we can unroll the entire network and produce all of the respective information that we need to then actually compute the update for the weights.
So you can write them down in five lines of pseudocode.
Now the back propagation through time then essentially produces a loss function and now the loss function is summing up essentially the losses that we already know from our previous lectures but we sum it up over the actual observations at every time t.
So we can for example take cross entropy then we compare the predicted output with the ground truth and then we compute the gradient of the loss function in a similar way as we already know it where we want to get the parameter update for our parameter vector theta that is composed of those three matrices, the two bias vectors and the vector H.
So the update of the parameters can then also be done using a learning rate in a very similar way as we have been doing this throughout the entire class.
Now the question is of course how do we get those derivatives and the idea is now to go back in time through the entire network.
So what do we do? Well we start at time t equals capital T and then iteratively compute the gradients for t up to one.
So just keep in mind that our y hat was produced by the sigmoid of OT which is composed by those two matrices so if we want to compute the partial derivative with respect to OT then we need the derivative of the sigmoid functions of OT times the partial derivative of the loss function with respect to yt hat.
Now you can see that the gradient with respect to why is going to be given as the gradient of OT times ht transpose and the gradient with respect to the bias is going to be given simply as the gradient of OT.
Okay so the gradient ht now depends on two elements the hidden state that is influenced by OT and the next hidden state ht plus one.
So we can get the gradient of ht as the partial derivative of ht plus one with respect to ht transpose times the gradient of ht plus one and then we still have to add the partial derivative of OT with respect to ht transpose times the gradient of OT.
And this can then be expressed as the weight matrix hh transpose times the tungent superbolicose derivative of whh times ht plus wxh times xt plus one plus the bias h multiplied with the gradient of ht plus one.
Plus why transpose times the gradient of OT.
So you can see that we can also implement this gradient with respect to matrices and now you already have all the updates for the hidden state.
Now we also want to compute the updates for the other weight matrices.
So let's see how this goes. We now have established essentially the way of computing the derivative with respect to our ht.
So now we can already propagate through time.
So for each t we essentially get one element in the sum.
And because we can compute the gradient ht we can now get the remaining gradients.
And in order to compute ht you see that we need the tungent superbolicose of UT which then contains the remaining weight matrices.
So we essentially get the derivative with respect to the two missing matrices and the bias by using the gradient ht times the tungent superbolicose derivative of UT.
And then depending on which matrix you want to update it's going to be ht minus one transpose or it's going to be xt transpose or for the bias you don't need to multiply with anything extra.
So these are essentially the ingredients that you need in order to compute the remaining update lines.
What we see now is that we can compute the gradients but they are dependent on t.
Now the question is how do we get the gradient for the sequence?
And now what we see is that the network that emerges in the unrolled state is essentially a network of shared weights.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:58 Min
Aufnahmedatum
2020-05-26
Hochgeladen am
2020-05-27 00:46:42
Sprache
en-US
Deep Learning - Recurrent Neural Networks Part 2
This video discusses training of simple RNNs using the backpropagation through time algorithm.
Video References:
Lex Fridman's Channel
Further Reading:
A gentle Introduction to Deep Learning