32 - Deep Learning - Plain Version 2020 [ID:21166]
50 von 111 angezeigt

Welcome back to deep learning and today we want to talk a little bit more about recurrent neural networks and in particular look into the training procedure.

Okay, so how does RNN training work? Let's look at a simple example and we start with a character level language model.

So we want to learn a character probability distribution from an input text and our vocabulary is going to be very easy.

It's going to be the letters H, E, L and O and we will encode them as one-hot vectors, which then gives us for example for H the vector 1 0 0 0.

Now we can go ahead and train an RNN on the sequence HELLO and we should learn that given H as the first input the networks should generate the sequence HELLO.

Now the network needs to know previous inputs when presented with an L because it needs to know whether it needs to generate an L or an O.

It's the same input but two different outputs so you have to know the context.

Let's look at this example and here you can already see how the decoding takes place.

So we put in essentially on the input layer again as one-hot encoded vectors the inputs.

Then we produce the hidden state H, T with the matrices that we've seen previously and produce outputs.

And you can see now we feed in the different letters and this then produces some outputs that can then be mapped via the one-hot encoding back to letters.

So this gives us essentially the possibility to run over the entire sequence and produce the desired outputs.

And now for the training, well the problem is how can we determine all of these weights?

And of course we want to maximize these weights with respect to predicting the correct component.

And this all can be achieved with the back propagation through time algorithm and the idea is we train on the unfolded network.

So here's a short sketch on how to do this.

So the idea is that we unfold the network so we compute the forward pass for the full sequence and then we can apply the loss.

So we essentially then back propagate over the entire sequence such that even things that happen in the very last state can have influence on the very beginning.

So we compute the backward pass through the full sequence to get the gradients and then the weight update.

So for one update with back propagation through time, I have to unroll this complete network that then is generated by the input sequence.

And then I can compare the output that was created with the desired output and compute the update.

So let's look at this in a little bit more detail.

The forward pass is of course just the computation of the hidden states and the output.

So we know that we have some input sequence that is the X1 to Xt where t is the sequence length.

And now I just repeat update our ut which is the linear part before the respective activation function.

And then we compute the activation function to get our new hidden state.

Then we compute the ot which is essentially the linear part before the sigmoid function.

And then we apply the sigmoid to produce the Y hat that is essentially the output of our network.

If we do so, then we can unroll the entire network and produce all of the respective information that we need to then actually compute the update for the weights.

Now the back propagation for time then essentially produces a loss function.

And now the loss function is summing up essentially the losses that we already know from our previous lectures, but we sum it up over the actual observations at every time t.

So we can, for example, take cross entropy.

Then we compare the predicted output with the ground truth.

And then we compute the gradient of the loss function in a similar way as we already know it.

Where we want to get the parameter update for our parameter vector theta that is composed of those three matrices, the two bias vectors and the vector H.

So the update of the parameters can then also be done using a learning rate in a very similar way as we have been doing this throughout the entire class.

Now the question is, of course, how do we get those derivatives? And the idea is now to go back in time through the entire network.

So what do we do? Well, we start at time t equals capital T and then iteratively compute the gradients for t up to one.

So just keep in mind that our y hat was produced by the sigmoid of O T, which is composed by those two matrices.

So if we want to compute the partial derivative with respect to O T, then we need the derivative of the sigmoid functions of O T times the partial derivative of the loss function with respect to y T hat.

Now you can see that the gradient with respect to W H y is going to be given as the gradient of O T times H T transpose.

And the gradient with respect to the bias is going to be given simply as the gradient of O T.

OK, so the gradient H T now depends on two elements, the hidden state that is influenced by O T and the next hidden state H T plus one.

So we can get the gradient of H T as the partial derivative of H T plus one with respect to H T transpose times the gradient of H T plus one. And then we still have to add the partial derivative of O T with respect to H T transpose times the gradient of O T.

And this can then be expressed as the weight matrix H H transpose times the Tungent Superbolicus derivative of W H H times H T plus W X H times X T plus one plus the bias H multiplied with the gradient of H T plus one

plus W H Y transpose times the gradient of O T.

So you can see that we can also implement this gradient with respect to matrices.

And now you already have all the updates for the hidden state.

Now we also want to compute the updates for the other weight matrices.

So let's see how this goes. We now have established essentially the way of computing the derivative with respect to our H T.

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:15:33 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 19:26:19

Sprache

en-US

Deep Learning - Recurrent Neural Networks Part 2

This video discusses training of simple RNNs using the backpropagation through time algorithm.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen