Good morning everybody.
Last lecture, are you excited?
The only correct answer.
Okay, then let's quickly recap what we talked about Tuesday so we can get to the interesting
stuff.
And neural networks, the idea is pretty simple.
We just have one layer which basically generates two outputs, the one that we're actually interested
in and a hidden vector which we concatenate with the next input and just feed back in.
We have a combination of the actual input X and the previous hidden layer of this particular
layer ZT minus one.
If we want to train that, we basically just unroll the network over the entire sequence
that we fed in as an input and then do back propagation through the whole thing.
The annoying thing obviously is that that means a priori that an RNN can only look into
the past of the sequence that we feed in.
So if we wanted to be able to also make choices based on future input for a particular sequence
that we're interested in, the typical solution for that is we just use bidirectional RNNs
which is just a fancy way of saying we take two RNNs, one that goes forward in time, one
that runs backward in time, and at every step concatenate the outputs of both of them for
that particular input.
That gives us bidirectional RNNs.
Now we still have this slightly annoying problem of the vanishing gradients which is just a
fancy term to say if I do back propagation through sufficiently many layers, the gradients
become smaller and smaller and smaller and at some point the learning effect massively
diminishes in the earlier layers of the network.
And if I have an RNN which I unroll over an entire sequence, conceptually for back propagation
that just means I have a whole bunch of layers and the gradients vanish through the back
propagation process.
So one possible solution for that is LSTM's, where LSTM's is just a fancy technique of
basically replacing the simple hidden vector which I use recursively at every particular
time step, I use something smarter, namely this new vector z, where z instead of just
being multiplied the way that neural networks usually work, we do something additive to
avoid that the gradients vanish.
Conceptually the way that that works is I have these three gates, quote unquote, the
forget gate, the input gate and the output gate that basically govern which components
of the current state vector do I keep, which ones do I drop, which ones do I modify based
on the current time step and so on and so forth.
Then sequence to sequence models, the typical application which is very instructive when
we just want to understand how these things work is neural machine translation, i.e. feed
in a sentence in one language, expect to get the same sentence in a different language
out, but of course conceptually the whole thing works for any kind of problem where
I need to convert some input sequence into some output sequence where I don't have a
clear one to one correspondence between the individual elements of the input sequence
and the output sequence, such as neural machine translation.
And here one nice technique is the encoder decoder architecture, which is I take two
LSTMs, i.e. two recurrent neural networks, one of which only serves to encode the input
sequence that gives me, after I fed the entire sequence in, some hidden vector and then I
take a different LSTM model as the decoder which gets fed the hidden layer of the input
network and then generates output steps until I have the sequence that I'm actually interested
in.
So it's basically just a concatenation of two LSTM networks, one that takes care of
Presenters
Zugänglich über
Offener Zugang
Dauer
01:15:51 Min
Aufnahmedatum
2024-07-18
Hochgeladen am
2024-07-18 22:39:06
Sprache
en-US