33 - Deep Learning - Recurrent Neural Networks Part 3 [ID:16711]
50 von 67 angezeigt

Welcome back to deep learning and today I want to show you one alternative solution to solve this vanishing gradient problem in recurrent neural networks.

They had to publish it because I was right.

You already noticed long temporal contexts, we will talk about long short-term memory units.

LSTMs and they have been introduced by Hochreiter and Schmidhuber and they were published in 1997

and they were designed to solve this vanishing gradient problem in the long-term dependencies

and the main idea is that you introduce gates that control writing and accessing the memory in additional state cells.

So let's have a look into the LSTM unit.

You see here one main feature is that we now have essentially two things that could be considered as a hidden state.

We have the cell state C and we have the hidden state H.

Again we have some input X, then we have quite a few of activation functions and we somehow combine them

and in the end we produce some output YT.

But this unit is much more complex than what we've seen previously in the L-Mans cell.

Okay, so what are the main features?

The LSTM gets some input XT, then it produces a hidden state.

It also has the cell state that we'll look into a little more detail in the next couple of slides to produce the output YT.

Now we have several gates and the gates essentially are used to control the flow of information.

There's a forget gate and this is used to forget old information in the cell state.

Then we have the input gate and this is essentially deciding new input into the cell state

and from this we then compute the updated cell state and the updated hidden state.

So let's look into the workflow.

We have the cell state after each time point T and the cell state undergoes only linear changes,

so there's no activation function.

You see there's only a multiplication and an addition on the path of the cell state.

So the cell state can flow through the unit unchanged and the cell state can be constant for multiple time steps.

Now we want to operate on the cell state and we do that with several gates and the first one is going to be the forget gate.

The key idea here is that we want to forget information from the cell state

and in another step we then want to think about how to actually put new information in the cell state that is going to be like memorizing things.

So the forget gate FT controls how much of the previous cell state is forgotten

and you can see FT is computed by a sigmoid function, so it's somewhere between 0 and 1

and it's essentially computed with a matrix multiplication of a concatenation of the hidden state and XT plus some bias.

And this is then multiplied to the cell state so we decide which parts of the cell state vector to forget and which ones to keep.

Now we also need to put in new information and for the new information we have to somehow decide what information to input into the cell state.

So here we need two activation functions, one that we call IT that is also produced by a sigmoid activation function,

again a matrix multiplication of the hidden state concatenated with the input plus some bias and the sigmoid function as non-linearity.

Remember this value is going to be between 0 and 1 so you could argue that IT is kind of selecting something.

Now then we have some C tilde which is a kind of update state that is produced by the Tungens hyperbolicus

and this then takes as input some weight matrix WC that is multiplied to the concatenation of hidden and input vector plus some bias C.

So essentially we have this index that is then multiplied to the intermediate cell state C tilde

and we could say that the Tungens hyperbolicus is producing some new cell state and then we select via IT which of these indices should be added to the current cell state.

So we multiply with IT the new produced CT tilde and add it to the cell state CT.

Now we update as we've just seen the complete update for the cell state is going to be a pointwise multiplication with the forget gate of the previous state

and then we add the elements of the update cell state that have been identified by IT with a pointwise multiplication.

So you see the update of the cell state is completely linear only multiplications and additions.

Now we still have to produce the hidden state and the output and as we've seen in the LMAN cell the output of our network only depends on the hidden state.

So we first update the hidden state by another nonlinearity that is then multiplied to the transformation of the cell state.

This gives us the new hidden state and from the new hidden state we produce the output with another nonlinearity.

So you see these are the update equations so we produce some OT which is essentially a proposal for the new hidden state by a sigmoid function

and then we multiply it with the Tungens hyperbolicus that is generated from the cell state in order to select which elements are actually produced.

This gives us the new hidden state and with the new hidden state we can then pass it through another nonlinearity in order to produce the output.

You can see here by the way that for the update of the hidden state and the production of the new output we omitted the transformation matrices

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:09:39 Min

Aufnahmedatum

2020-05-27

Hochgeladen am

2020-05-27 13:46:38

Sprache

en-US

Deep Learning - Recurrent Neural Networks Part 3

This video discusses Long-Short-Term Memory Units.

Video References:
Lex Fridman's Channel

Further Reading:
A gentle Introduction to Deep Learning

Tags

artificial intelligence deep learning machine learning pattern recognition Gradient descent Recurrent Neural Networks Long-Short-Term-Memory Units
Einbetten
Wordpress FAU Plugin
iFrame
Teilen