I guess due to the traffic jam on the highway or on the way to Erlangen Süd, they are stuck.
So let's anyways begin with this lecture.
I'm Vincent Kristlein, so here the second last author on this long author list.
And I was actually responsible for this slide set, and since Andreas Meyer is unfortunately
sick, I will hold this presentation today.
And it's going to be about unsupervised deep learning.
So the motivation is basically so far in the lecture we have huge data sets, so we have
the ImageNet data set which goes up to millions.
We have many variations in the objects, but very few modalities.
In contrast, in medical imaging we have a problem here.
So we have only typical very small size data sets from maybe 30 to 100 patients, and that's
already quite a lot.
We have studies and PhD thesis which have only three patients, and that's quite a few.
And unfortunately for these methods, then typically deep learning methods don't apply.
But here we will try to find methods which can use additional data.
So we have a high variation from one complex object, for example an X-ray, chest X-ray,
but in contrast many modalities.
So it can look very – we have a high variation here.
And so we have actually quite a lot scans, so 5 million CT scans alone in the year 2014,
where roughly 65 CT scans per 1,000 inhabitants.
I guess some of you already had one CT scan, so I had at least one.
But it is highly sensitive data, so there is an ongoing trend to make this data available,
but currently it's not the case, and maybe that's good, or maybe for the science or
research it's of course bad.
But yeah, you don't – or many of you maybe don't want to show what sicknesses they had
or have.
And yeah, the difficulties here, although there is no more data available, it's quite
difficult and expensive to obtain training data, because this has to be done by physicians
who really are the experts in this domain.
And so some solutions exist.
So one solution is weekly supervised learning.
Weekly supervised learning means if you have, for example, detection or a segmentation task,
but you only have a rough boundary or – how do you say – bounding box of the object,
and then you can use weekly supervised learning to actually obtain a fine-grained segmentation.
And this can be done, for example, by using methods like where you look at where the network
basically looks at, and then you get the object localization for free, kind of.
And this is done, for example, here in these examples, in the image example.
So if you train a classifier for brushing teeth and you always feed them with the big
images, in the end, the classifier will train and will learn what is actually the toothbrush.
But you see it also reacts a little bit to the hand, because the hand is always holding
a toothbrush.
And yeah, here also cutting trees.
It also goes on the helmet, maybe you see, so on the face here.
Maybe you can see that.
I'm not so sure.
I don't have a mouse here.
Okay.
Anyways.
So it has also some drawbacks here.
And then we can do semi-supervised learning.
That means we have a large dataset, but only a part of it is actually labeled.
And then we have the task of unsupervised learning, where we have no labeled data at
all.
And today we will mainly talk about the last two parts, but more on the unsupervised part,
and not on the semi-supervised.
So we have one trick how to make it semi-supervised, but this is an unknown area of research.
So the application of label-free learning is, for example, dimensionality reduction.
So if you have data on a manifold, you want to reduce the data in such a way that you
still can do meaningful manner, in a meaningful manner.
If you do now here PCA, you would basically fail to make a meaningful dimensionality reduction.
If you, however, use a nonlinear PCA, for example, then this could work.
Or here I think this is probably an easel map or something.
And we have, of course, this can also be used to initialize a neural network.
So we first train our neural network unsupervised in an unsupervised fashion, and then use some
layer, for example, in the middle here to, so the first part of the neural network as
an initialization for a supervised task.
So and then we have a quite good pre-trained network, and then we just change the last
layer, basically, or the last few layers, according to our classes.
And yeah, this is a kind of transfer learning, of course.
You can, of course, do this also in a supervised manner.
So you take a pre-trained network on ImageNet, and then you have, but this was trained fully
supervised, in contrast to this network, which is depicted here on the top, which was trained
unsupervised.
And yeah, and for you want maybe also to just cluster, so totally you have your trained
features, and now you want to cluster them.
So which are similar to each other, and what is not similar, or dissimilar.
And we have applications as generative models, and this is quite a big part now of this lecture
will also be about the generative models, where we try to generate realistic tasks.
So in this example, we want to really make images looking like they come from a data
set.
And so I guess no one of you could see that the right samples are actually made up from
the algorithm.
So it's no supervision here, and it just comes out.
So if you have missing data, then, of the in this generation task, we fall basically
in the category of semi-supervised learning.
And we can also do cool stuff, like image to image translation.
We will see that later what that means.
And we can also simulate possible futures, but this goes more in the direction of reinforcement
learning.
I don't know if that lecture was already, yeah, was already, okay.
So yeah, generative networks, guns, and reinforcement learning, they are actually in some degree
connected to each other.
Okay.
So today, we will talk about label-free learning, and we have three topics here.
Once the restricted Boltzmann machines, they are outdated.
Yeah, we can say that directly.
So but from a historical point of view, we wanted to bring this.
And they were the building blocks of deep belief networks.
So you stack multiple restricted Boltzmann machines, or short RPMs, together, and then
you build up a deep belief network.
So they were basically the first ones which brought deep learning again, so made them
so popular.
So Hinton et al. came up with that.
And then we have autoencoders.
The other big grandfather of deep learning, na ja, grandfather, always sounds strange.
But Benjio came up with that, mainly.
And so they do also nonlinear dimensionality reduction, and you can also extend them to
be a generative model due to, for example, variational autoencoders, where also part
of the lecture will be on.
And then at the last part, we will talk about generative adversarial networks, a quite neat
way to generate very good-looking images.
And that's also the currently most widely used generative model we have.
So and the concept of generative adversarial training can also be used for segmentation,
reconstruction, semi-supervised learning, and more.
So this is a very good way of fighting, basically.
So we have two networks fighting against each other, and that's a really nice idea.
And now we have many applications in this domain.
OK, let's first talk about restricted Boltzmann machines.
So in restricted Boltzmann machines, we have some visible units.
This is our input data.
Could be an image.
Could be also just some, yeah, typically an image, yeah, for in our case, but it could
also be a speech, of course, or other signal.
And then we have hidden units.
Now you can think of that like in a neural network, where we also have hidden layers
here.
But these are here not meant, so these are here latent variables, yeah?
So they learn a representation of the input.
And the big question is now, how do we actually learn this representation?
So here in this lecture, we assume that we have a binary, so also called Bernoulli visible
units, V, and binary hidden units, H.
So and now it's a little bit of math here, but we hold it quite small, this chapter,
restricted Boltzmann machine, because they are not so relevant anymore.
Okay, so the whole thing is that we have here a Boltzmann distribution.
This is the first term, yeah?
And it comes from physics, it's an energy-based model, and we have this joint probability
function P with the visible units V and H. So this is our Boltzmann distribution.
And it looks, if you have a look at the equation, it looks quite a lot like a softmax function,
yeah?
It's very similar to that.
So we have a function Z here.
Maybe I have a laser pointer even, and I can point there.
It's a little bit strange if you kind of point on that.
So the Z is just a normalization constant here, and then we have an exponential term,
yeah?
And then we have some biases, yeah?
So B and C, these are our biases, and we have the connection matrix from our visible layer
to the hidden layer.
So this is the W. And yeah, be careful, yeah?
This is not a fully connected layer in any way, yeah?
This is not feed-forward.
So it models the input layer in a stochastic manner, and this is the big difference here.
And it's trained totally unsupervised.
And how do we now get basically this joint probability distribution?
Oh, no, okay, sorry.
Now first of all, the training of the RBMs.
So we have, we can make this as a bipartite graph, yeah?
We can visualize that.
And this is then basically you can say that this is an RBM, it's a Markov random field
with hidden variables.
And we want to find our W such that the Boltzmann distribution is high for low energy states
and low for high energy states.
And the learning is based on gradient descent on the negative likelihood, but we have now
a distribution here where we have to, which is quite complicated to do directly.
So what we do is we have a Markov random field here, and in Markov random fields, you're
typically doing a Markov Monte Carlo chain estimation.
And first of all, we build up our log likelihood.
So the log likelihood here of our parameters given V is our Boltzmann distribution of the,
yeah, please.
This is related to this energy term E here.
Where do we have that?
So this is this one.
So E from V and H, and this is meant by that.
So this comes from the physics.
The neuron is the state, yeah?
Okay.
Okay.
And so, and then we have, okay, yeah, we can compute here the log likelihood.
So we just put it in, yeah, and take the logarithm of that.
So this is the same equation here.
Okay, maybe I should also put that in.
And yeah, we can, yeah, you can, the divider, yeah, you can bring with minus on the right
side.
So this is just normal derivation of that equation.
And now we have a quite lengthy derivation of the derivative here.
And if you really want to see it, how that evolves, please have a look at this reference.
It goes over five, six lines or something.
So we wanted to spare that here and make it short.
So the short thing is we have a data term, and which we can obtain here, yeah, but we
have also a model term.
And the model term is unfortunately untrackable.
And here comes now the sampling in place.
So here we need a Markov chain Monte Carlo technique, basically.
And we do this by Gibbs sampling.
And this Gibbs sampling part is part of the contrastive divergence algorithm.
And contrastive divergence is actually not very complicated, yeah.
So you take any arbitrary sample as your visible unit.
You start with that.
It's of course good if that's a typical sample, yeah.
And then we set the binary states of the hidden units.
So typically by a sigmoid function.
So we have a sigmoid function that, so the probability that the hidden unit is one, given
that we have our visible unit, so our input training sample.
So we compute our sigmoid function with the current estimate of the visible unit, so our
training sample, and some random variable, of course, of our bias C and our W parameter,
our weights.
And then we use Gibbs sampling here.
And you can run those K times, but in practice it has been shown that K is one is totally
efficient, so you only sample one.
You make this reconstruction then only once.
And the reconstruction means we have a reconstruction of our visible unit by computing the probability
of that our visible unit is one, given now our hidden state.
And this again is modeled by a sigmoid function.
And here we again place in here what we have already now set for H, what was here in the,
what we computed for the hidden unit.
And then we compute this, and then we resample H. So we get a new estimate of that one.
So it's again here step two.
And we can do this multiple times, but in practice only one is sufficient, yeah.
And then we have the derivatives of our weights, of our bias, and of our C term.
So these are approximate derivatives, yeah.
It's not real derivatives in that sense.
So here the difference between our original one and our reconstruction of V and H. And
this gives with a certain learning rate at our new weight update.
And then we can update the weight as usual.
And here, but not so gradient descent, yeah.
No back propagation involved here.
And then also for the bias and for the C term.
Is that so far clear?
I mean it's contrast of divergence.
It's not very super complicated.
But the whole theory behind it is actually quite complicated.
So we spare that here quite a lot in the lecture.
You could hold a complete lecture only on RBMs, but that's not what we wanted here.
Okay.
So and then as said in the beginning, we can now stack multiple RBMs on top of each other.
And typically they are trained then iteratively.
So you train first the restricted Boltzmann machine of layer one.
And then you put another layer on top and use the RBM as your, so the RBM number one
is your input for the RBM number two.
And you train them basically layer by layer.
I don't know how deep the first deep belief network was.
I think it was only two layers.
And the last layer, you could also fine tune then, for example, with back propagation for
a classification task.
And yeah, and this was one of the first successful deep architectures and kind of sparked the
deep learning renaissance.
But nowadays I have not seen any publication anymore since several years.
So I've used it in the very beginning of my, at the lab.
We used it for finding good features of clouds.
We wanted to get the energy of solar panels.
So if the clouds come to the sun and cover the sun, then of course we presume that there
is a low energy in our solar panels.
We used this to compute the features for clouds.
But yeah, nowadays we would not take that anymore.
Okay, now comes more, which is still quite in practice, autoencoders.
And autoencoders is a kind of special case of a feed-forward neural network.
So we have an encoder Y, which we have a network F, which takes our input and we get an output
Y.
So here these would be the activations, Y.
And the question is how do we get a loss here on that?
So if we want now these hidden layer Y.
And there's a quite simple trick.
We just say, okay, hey, we want to reconstruct our input with a decoder network.
So we have a decoder here, again a neural network, which comes afterwards, which takes
the, our hidden layer as input and generates our output, or reconstructs our output.
And then the loss can be, for example, mean squared error between the input and the output.
And the autoencoder tries to learn here an approximation of the identity.
But if we would learn the identity, it would be not so good.
So here just some loss functions.
Yeah, we want to minimize, we want to maximize the log likelihood, so minimize the negative
log likelihood.
And you can use the squared two norm, so mean squared error.
Or you can use, if you have input samples, which are zero or one, the nulli distribution,
then you could also use the binary cross entropy here.
It also works, by the way, this term here, if you don't have zero, one, but values between
zero and one also work.
But then be aware that this term here will never be, get zero.
So even if you put for X and X, Y, X tick is completely the same value, you won't get
zero here.
That's just because cross entropy works like that.
Okay.
So yeah, now you don't want to learn the identity in the middle.
And if you have now, if you would have as many neurons as the input, the network would,
of course, learn directly the input.
And this is not what you want, the identity.
And if you would learn the identity, it just would forward propagate the input.
And it would solve it perfectly, but our features would still be meaningless.
So this is, we have to prevent this.
And we can just do this by reducing the number of neurons in the hidden layer.
So we have here an autoencoder with linear layers, or can be linear, with linear layers,
one squared L2 norm, then it resembles a PCA.
If we don't have linear layers, but nonlinear layers, then it would be, so for example,
with relu or whatever, then it is a PCA generalization in the nonlinear domain.
And that's quite cool, actually, if you think about it, so that you can simulate a PCA by
an autoencoder.
And you can just use then as initialization, for example, and yeah, make also nonlinear
PCA, of course.
And if you have now smaller input than your middle layer, you have to enforce somehow
that, again, that we don't train the identity here, you could enforce that by adding a regularization
term.
But you have to be a little bit careful here.
You don't have to apply the regularization on the weights, but on the activations.
Now this is quite important.
And then, so as regularization, you can use L1 norm or KL divergence.
It also works if you make a regularization by L2 norm.
But then, of course, the units won't be necessarily sparse anymore.
So sparse weights, you can use the L1 norm.
Or yeah, you typically use the L1 norm.
Okay.
So then we have just some important terms here so that you know that also autoencoders
can be built with CNNs, of course.
Middle layers can be convolutional layers.
And so these are then called convolutional autoencoders.
And another interesting thing is that autoencoders, since we have this reconstruction term here,
so it can be used for denoising.
If you just put some noise on the input, and as reconstruction, you require or request
that the output is again the same, the original X, the original sample, then the network would
learn to denoise the input.
So if you have back again this equation here, you set these inputs, you put some noise on
it, and here, the output, you require that this is the original input.
And then the network would learn how it could denoise the input.
That's also quite nice.
And of course, you can use here different noise models, like here, Gaussian distributed
noise.
You could also randomly set elements to zero.
This is then kind of like a dropout, so on the input.
If you just set some to zero, this is a form of dropout on the input.
And this is then, of course, also some form of regularization.
So each autoencoder which you train, you typically do that, because it's a very good form of
regularization.
It's not a good denoiser.
Please don't use autoencoders for image denoising.
There are much better ways, although you could do in theory.
But this is because of this, because it somehow reduces the noise, it's set, or the name comes
to the denoising autoencoder.
Okay.
We can also use this as a generative model.
So the denoising autoencoders, because they implicitly estimate the underlying data generation
process, so given now a noise model, C, X hat, given X, and then the denoising autoencoder
learns this probability distribution of P, X given X hat.
And so we can just sample from that.
So the intuition is if X is a typical sample, and we iteratively apply the noise and the
denoising, we'll reproduce the sample very often, hopefully.
So for the estimator, it's also a form of Markov chain sampling by alternating the denoising
model and the corruption process.
Yeah.
So you can do this with a Markov chain sampling in this form, but then variational autoencoders,
which we will discuss very soon, is much more common.
So instead of using the trick of sampling, you can make a variational inference and use
the variational inference to get up with this joint probability.
Okay.
Some more term.
We have stacked autoencoders, and in the original form, they were trained iteratively.
So we have an autoencoder, which is just, yeah, here has one hidden layer in the middle,
yeah, and you use these features then, which are generated again as input for our second
autoencoder.
This is like the RBM was trained, and here we should do completely the same with the
autoencoder.
Of course, nowadays, you wouldn't do that, but instead you would just make the whole
autoencoder larger and put there more layers into it, and it has kind of the same effect,
yeah, or even better, because you fine-tune basically than the original layers also.
Okay.
And, yeah, through backpropagation, that's no problem to do that.
Okay.
So now let's come to a specific case of autoencoders, which is very influential, and they are called
variational autoencoders.
It's, by the way, new for this deep learning lecture, so last year we didn't, or last semester
we did not have that in, but it's definitely a must for deep learning, so it has to be
in.
So, in the traditional autoencoders, we compute a deterministic feature vector, yeah, describing
the attributes of the input in latent space.
So for example, yeah, I don't know if you can read it, so smile, it smiles by 99, yeah,
0.99.
The skin tone is 0.85.
So we have deterministic feature vector here, and the decoder then generates our, hopefully
generates again the input.
Of course, it won't look so nice, but a little bit blurry and so on, but that works, yeah.
And the key difference now in variational autoencoders is that we use a variational
approach to learn this latent representation, and so we have a probabilistic manner which
describes the observation in latent space.
And if we want to describe something in a probabilistic manner, that typically means
we have some probability distribution per latent sample, per latent variable.
So each latent attribute is modeled as a probability distribution, so typically a Gaussian, yeah,
with a mean and a covariance.
And this allows to model uncertainty in the input data.
And this is quite nice.
So here, for example, we have, okay, the smile factor of Mona Lisa is quite difficult, so
it has a quite big Gaussian shape here in contrast to the other faces where it's more
clear if they are laughing or if they are sad.
So yeah, so now the question is again, how do we get these probability distribution?
And okay, yeah, if we now, in the decoding phase, we sample now from this latent space
and get hopefully, if we sample from different values here, so each time if we sample, we
would get a different value here.
Yeah, not always the mean here, but it's somewhere probably here, yeah.
And we hope that each of these, if we make different samples here, if we draw different
samples from this latent representation with this input, that we always get the same or
very similar output.
And yeah, so we represent it as a probability distribution which enforces a continuous smooth
latent space representation.
So all the similar latent space vectors should correspond somewhat to the similar reconstruction
as said.
Okay, and so the assumption is that we have a hidden variable, latent variable Z, that
generates an observation X.
So and training a variational autoencoder is basically the question of determining this
distribution of Z.
So Z here has means and variances, yeah, if you assume Gaussians.
And yeah, the problem is that computing this arbitrary PC from X is usually intractable.
Maybe I can draw this here directly.
Does that work?
Oh yeah, cool.
So it is intractable.
Look if we have, so if we have something like that, so we can replace those using bias.
So and, no, Z, sorry, PX.
And yeah, this here is unfortunately something like this.
So we can marginalize that.
And this would not be the problem here.
So PX over all Z and then DC.
So this term we get basically, we can get, but the problem is this term.
Because if we have a high dimensional latent vector Z, then we would need to marginalize
over the whole latent space.
And this is intractable typically.
So somehow we have to find now this probability distribution.
And the trick now which comes in place is called variational inference.
And the trick is that, okay, we cannot approximate this term.
We cannot approximate that, but we replace it by a tractable distribution Q.
And this we just assume that it comes from some normal distribution or some Gaussian.
And we then only require that the parameters of Q of our function is very similar to that.
And how do you get two distributions similar?
Maybe someone knows?
Okay, so you can get them similar if you minimize the KL divergence with these two probability
distributions.
KL divergence is a measurement of how similar two distributions are.
And if you minimize that, then you basically get this form.
And this is exactly here written.
So we minimize the KL divergence between our true posterior and our approximation, our
tractable distribution.
And this is equivalent to the reconstruction likelihood.
So actually there's another term here, but so here it continues.
And this is greater than zero.
And actually what we hear train is then a lower bound on the variation.
So it's called a variational lower bound.
And we try to maximize that.
And this is basically here the reconstruction likelihood.
And then we have this KL divergence between our prior here now.
This is the prior of Z and our tractable distribution.
And this we can compute.
And in this way, we are forcing our tractable distribution to be similar to the true prior
distribution PC.
So yeah, as I said, typically you're choosing a Gaussian distribution here.
And it boils down to estimating mean and variances.
And the cool thing is we can use neural networks to do that for us.
So we have a neural network mapping from X to our latent representation Z.
And then we have another neural network which goes from our latent space representation
Z to our neural network to the reconstruction, to our input again.
So that this gets mapped correctly.
And so we have now not only the mean squared error, but also the Kuhlberg-Leibler divergence
term in it to compute that.
So in a bigger form, so our loss function now has the log likelihood term here.
And so there's the log likelihood term, and we have the Kuhl divergence term.
And now we have a problem here.
And the problem is if you practically want to implement that, you will recognize, huh,
how do I actually back propagate through a sampling process?
Because this seems to be impossible, right?
So how can I get through the sampling process of C?
So here we have a sampling process, this one here.
This is questionable, yeah?
And here comes a trick, and the trick is called reparametrization trick.
And this is a quite neat trick.
So in our form, currently what we have, we have some deterministic nodes.
We have a mean and a variance which we train by our network.
And before that, the encoder comes.
Now we have to sample through our, yeah, kind of random node here, through our distribution
to get the decoder.
But how do we train that then in practice?
Because this is a random node.
How do we do that?
And the trick is that we push the random sampling out of the back propagation path by reparametrization.
So we have a reparametrization Z which takes mean and the variance term, and we multiply
then the variance by the stochastic process here, by the random process here.
And in this way, we don't back propagate then through this node, of course.
We cannot, but we can normal back propagate so that mean and sigma and Z and so on gets
updated.
And I think this is quite cool trick.
So this reparametrization trick.
And in this way, we get a deterministic back propagation.
Also be aware, sigma, of course, can be negative.
So in practice, you are computing the logarithm of sigma, and then you take the exponential
of that again.
So if you do it in practice, there are some more tricks you have to apply.
But this is the main trick, the reparametrization trick.
Yeah, and then we can have some look at the visualization.
So on the left side, this would be a classical autoencoder.
You see if we train that on MNIST, it actually works quite nicely.
So we have here some nice distributions here all around, but it doesn't cover the whole
space.
And the distributions are all very different from each other.
So the distribution of a 2 looks completely different than the distribution of a 0.
And in contrast, if you would only now use the Cayenne divergence, yeah, it's only random
here.
So this is why here we have only the stochastic process here.
It's of course also not very meaningful.
So in contrast, now, if we combine both terms, so the mean squared error term and the Cayenne
divergence term, we get in a nice distributions which are closer or more closely related to
the actual prior distribution of our MNIST data here in this case.
You can also, of course, weigh the Cayenne divergence term higher if you like.
You just introduce another variable and you just weigh this term.
Typically it's so standard it would be one, so you can weight it higher than one and then
you get more weight on the Cayenne divergence term.
This is also often done.
So we can also use then this variation autoencoders of course as generative models.
Now we have probability distribution, so we can do that.
And it's just sampling from the distribution in the latent space and reconstructed then
by the decoder.
So if you have the latent representation and the decoder, you can just sample and you get
cool or nice looking images.
And of course because we use here a diagonal prior, it enforces independent latent variables
and in this way we can encode different factors of variation.
So you can also encode, I don't know, here we have an example here, varying degree of
smile.
So from top to bottom it's a degree of smile and from left to right the head pose changes.
And so we have one latent variable for the head pose and another one for smile for example.
And you can directly, it is directly encoded in the latent variable and you don't do anything.
You just feed the autoencoder with different images of rotations and smile and it will
come up with these latent representation.
And those are really neat.
Of course if you have other latent factors then you would need a bigger latent space.
So the vector needs to be bigger.
Okay so in summary, we have now, we have probabilistic models.
They allow to generate data.
Like the RBM is a probabilistic model but uses Gibbs sampling from the Markov chain.
And here in variational autoencoders we are trying to maximize the variational lower bound.
And yeah, we have those intractable densities here.
But instead of using this Gibbs sampling trick on the contrast of the versions of RBM, we
are using directly backpropagation by using the parameterization trick.
And this is a really nice trick.
So it's a principled approach to generative models.
We have this latent space representation which could be useful for other tasks.
So for example if you want to generate meaningful features, this is a cool thing for doing that.
Of course if you just need any features, you don't need that.
So typically if you introduce it, they are becoming more meaningful and also more representative.
On the back side, we still only maximize the lower bound of, so the variational lower bound
of the likelihood.
And so mathematically, yeah, we have some often, especially in the old variational autoencoder
papers, quite blurry output, if you sample now from that, quite blurry output.
In contrast to the generative adversarial networks which we will cover now.
Nevertheless, it's a quite active area of research and nowadays in the very recent autoencoder
papers, they manage that you also get sharp, that they also generate sharp images.
Okay.
Let's talk about guns.
And guns, yeah, quite influential research direction and because they make just, they
are very impressive.
And generative adversarial networks have a quite nice, very intuitive way of training
that.
So in contrast to from before where we directly tried to model somehow our likelihood, here
we make this indirectly, in an indirect way.
And this is done by a discriminator and a generator.
And so assume you have a discriminator, a detective, and you have a generator who makes
artworks, yeah.
And maybe it generates some image and the discriminator now, the discriminator is asked,
is that a fake or a real painting of an old painter?
So we have some real data and we have a generator.
And of course, the generator and the discriminator are actually networks.
So we have a network, we have two networks here, a discriminator which says, is an input
image real or fake?
And we have a generator on the other side which gets some random noise and tries out
of, from the random noise, generate data which looks like real data to fool the discriminator.
So we have a game here, a two-player game.
So, and we have a generator and a discriminator and they play against each other.
And yeah, and in this way, we basically model our noise distribution to a real, to the distribution
of our data.
And we can of course formulate this in equations.
So we have here the loss of the discriminator which has as parameters, the parameters of
the discriminator and the generator.
So we have both networks here, yeah.
And then we want the discriminator that it is good in discriminating if it is real or
fake.
So this is the first term.
And of course, for real data and if it's a fake data, yeah, we have, we want also to
detect that in a correct way.
So here, this is the generator.
So this is the generator here.
And it generates from the noise and sample.
And we want now that the discriminator, so it's the discriminator here, yeah.
And that wants to be good.
So it has to detect that this is a fake image.
So one minus here of X.
And yeah, so it is trying to distinguish real and from fake ones.
And the generator can be trained on the fly in the same way, but just taking the negative
of it.
Yeah, so we want that the generator minimizes the log probability of the discriminator to
be correct.
So he wants, the generator wants to fool the discriminator.
And yeah.
And this is, you can optionally do this in car steps so that one player gets updated
more often than the other one, but typically you just take one.
And now an equilibrium is approached if we have a settled point of the discriminator
loss.
And yeah, so again here, the discriminator and the generator.
And here in this case, the loss of the generator is directly tied to the discriminator loss.
And in summary, yeah, we can summarize the game with a value function.
And this is called the minimax game.
So we have the discriminator's payoff here.
Now this is the payoff term.
And this can be put in a value function, so nothing else than the negative log likelihood
of the discriminator.
And we play this minimax game here.
So we want to maximize the parameters of the discriminators and minimize the generator.
And yeah, just know the term minimax game, that how do you train it by minimax game?
And now comes a little bit theory slide here.
What would be the optimal discriminator?
And yeah, the optimal discriminator would be when both densities are non-zero everywhere.
So this is the general assumption here.
And otherwise, yeah, some input values would not be trained.
And we would have undetermined behavior of the discriminator.
So we derive for the discriminator loss here.
And then the optimal discriminator, of course, would be if the data distribution would be
exactly similar to the model distribution.
So model here is the generator.
So p model would be the distribution of the generator.
And if they would be equal, then this would be the optimal discriminator.
And the optimal, yeah, the optimum would be here, of course, one half.
So they are both similar, the bottom term, then we end up in one half.
This is, of course, a theoretical concept, and it will never be reached.
But it's still the GAN's key approximation mechanism that we have a discriminator, which
is going to be perfect.
And yeah, and we use a supervised learning here to estimate this ratio.
And of course, since it's a neural network, we are prone to underfitting and overfitting,
of course.
So now we have a little bit change here, yeah, because if we do it like this slide here,
where the generator just takes the negative of the discriminator, assuming the very beginning,
the discriminator will overpower the generator directly from the beginning.
And so the generator will always fail to actually generate something useful.
And in this way, if you take now just the logarithm of the discriminator's output when
it has seen a fake sample, it's so G minimizes now the log probability of D being correct.
So this is the old form, yeah, and here now we make the G maximizes the log probability
of D being mistaken.
So we have a small shift in the interpretation.
And in this way, we get more stable.
So of course, there are many more loss functions out there.
I don't know how many variations of guns, loss functions are there.
And this was one of the very first improvements that they just take the logarithm here of
the discriminator's output.
Okay, yeah, so especially in the beginning of training, that makes very much sense.
Yeah, but the equilibrium now is no longer describable with a single loss.
So now we have two losses.
And there is another trick, another popular loss function.
Maybe this should come maybe at some different point in the slides, but nevertheless.
So this would be another loss.
It's called perceptual loss.
And there's also a feature matching loss.
So actually, they are very, very similar to each other.
In perceptual loss, you're taking, sorry.
So the idea is we have a trained network, F or FX, of an intermediate layer of D, for
example.
And then we say, okay, this layer has to be very similar to the output of our network
here of the layer for the generated output.
So this one here is just another network, either another network.
So very typical, it's VGG 16.
Or you take an intermediate layer of D. So you can do both.
And this time is typically called perceptual loss, and this term here is typically called
feature matching loss if you directly take the discriminator.
Yeah, it's kind of the same.
Typically what works better is the perceptual loss, because you take a pre-trained network,
which is already quite good trained.
So VGG 16, you take a pre-trained network, and then you're using that.
And in this way, we prevent overtraining of G on the current discriminator.
This is also quite, so this perceptual loss, you will also find in supervised training
quite a lot, and not only for guns.
You can use that actually, yeah, nearly always you can add this term to your normal mean
squared error or cross entropy term.
There is another popular loss function, the Wasserstein loss.
And here, yeah, you have this, so maybe you can look at this graph here.
So we have the density here of our real distribution.
We have the density of our fake samples.
And the normal gun discriminator would be here, the red line, yeah.
And we have here vanishing gradients in the regular gun training.
And instead, we want, we introduce a Wasserstein gun, and this is now called a Kritik, yeah,
Kritik.
And yeah, it has a linear gradient here in the Wasserstein gun.
And this is done by maximizing the discrepancy between real and fake.
And you're using basically, yeah, you clip the weights, yeah, and this is a very simple
trick, yeah, you just clip the weights so that they are not bigger than one.
And then in this way, you enforce, yeah, that the weights lie in the space of one Lipschitz
functions.
And it's a quite neat trick, yeah.
And in this way, we counter vanishing gradients in the discriminator D. So yeah, because it
would be problematic if the weights here could grow arbitrarily large, yeah, then the gradients
would be unbounded.
And what you just, for preventing vanishing gradient, it's just hard clipping here, yeah.
That's all.
It's very simple.
So programmatically, it's very, very simple.
So there exists, of course, many more loss functions, yeah.
For example, the KL divergence, of course.
Then you have, yeah, maximum likelihood here so that guns do maximum likelihood.
But the approximation strategy actually matters more than the loss.
So I would say it's not totally true, but there was an overview paper which explored
very many different loss functions and showed in the end that it doesn't matter so much,
the loss function.
Then there were, of course, counter papers again, which said, no, look, here Wasserstein
loss is indeed better, and so on.
But be aware, nevertheless, it's always complicated to train guns to some degree.
So if you're using guns, try the basic networks again first before exploring different ways
or finding out your own way.
It's quite engineering-driven then.
Okay.
How do you actually evaluate guns?
That's quite tricky because now you have only these images here, which are generated.
And one trick is, okay, hey, we say that the images we are generating should be recognizable.
And what is recognizable?
Of course, a pre-trained network, yeah, so which is already trained for ImageNet now
also has to recognize the correct class for our output.
And the score distribution here should be dominated by one class.
So if you have all the scores, then we have a high peak now at one class.
Now we cannot say which class because then it would be supervised.
You could also introduce supervision here, but this is not typically the case.
So the image-wise class distribution should have low entropy.
So that it means that we have these high peaks here.
So and the generated images should be diverse.
So the class distribution should be kind of uniform.
So if you draw multiple times, then the class distribution should be uniform.
And in this case, the entropy should be high.
So be careful.
Once is the image-wise class distribution and once is just the class distribution.
So the image-wise is directly the output of one single image.
And the second part is for multiple images.
And how can you do this?
Again, here comes Kullback-Leibler divergence as our friend.
We are taking here the computing the Kullback-Leibler divergence between our output given x and
our prior here px here, p y.
So and the higher, the better.
So that was the inception score.
Then comes another score.
This is called the Frichet Inception Distance.
And here we are using again an intermediate layer.
You can use a pre-trend inception V3 is typically used.
I don't know why they not used here VGG or another network.
But to be comparable to the others, now everybody has to use the same network.
And so it's a little bit, yeah, just use the inception V3 if you do that and you're fine.
And then you model the data distribution by a multivariate Gaussians.
And the Frichet Inception Distance is then the score between the real images x and the
generated images g.
And here we have this term.
So we subtract the means and take the trace of the covariances added and so on.
And it has been shown that this is more robust to noise than the inception score.
And you don't know you need the class concept.
So it is not really relevant to have very good class scores.
But of course the features of the inception V3 network should still be reasonable.
Otherwise it doesn't make sense.
Okay.
So how does it compare to other generative models?
So here we can, it has the ability to generate samples in parallel.
We have very few restrictions in contrast to the Boltzmann machines.
We don't have here a Markov chain and we also have no variation of bound like in the variational
autoencoders.
And yeah, it is known to be asymptotically consistent since the model families are universal
function approximations.
Since we use deep neural networks, it works really well.
We can also put conditions here on the gun.
And so we can say that we only want to generate images for faces, for example.
And if we have, then we need to somehow introduce this class concept here or the conditions
here.
Or for another example would be text to image generation where the image should depend on
the text.
So we can introduce this and this is also quite nice.
So the idea is we provide an additional vector y here to the network, which encodes the conditioning.
So the y vector would be, for example, the zero one two.
So the here a vector of length 10, yeah, where each value represents one class.
So we additionally introduce our labels.
So we have supervised here learning of the classes.
And this enables us to generate specific classes.
We just have to add again the vector with where we have where we say that one specific
class.
So for example, number two or something has a one and all the others have zeros.
And then it will generate an image which has which follows this vector and generates the
two.
So in practice, this is done in the following way.
So we have a generator G, which receives the latent vectors, see, so the our noise and
a conditioning vector y and the same conditioning vector y also gets the discriminator.
And yeah, the it's it then boils down.
So we just concatenate them.
And in that sense, we can also say that the minimax game changes.
And here the minimax game just uses the now the conditional distributions.
So D of x given y and the same for the generator C given y.
So quite simple way of introducing conditions is just concatenate the vectors and for the
generator and for the discriminator.
So the discriminator has to learn also the class.
And yeah, you can also use other factors like smiling, gender, old or age.
And then the generator and the discriminator learns to operate in modes.
And yeah, we can use this to generate faces which are specifically doing something like
smiling or have a specific gender or specific age.
And the discriminator then learns to decide whether the face contains the attribute or
not.
We can also do image to image translation.
This is also quite cool.
So for example, you have images here which are where you have only images in the sun
and you want to simulate them.
How would they look in the dark or vice versa?
And so you can make these transformations.
And the cool thing is you only need some training data for one direction and you get the other
direction for free basically.
Or here we have these satellite maps or we have some segmentation masks which you insert
segmentation here as input and you want to generate how would actually the original image
look like or facades or black and white to color or if you just have a hand drawn back
here you want to generate how would a back look like?
And this is just a conditional gun.
You have as a discriminator, you have two inputs here.
So you get the real image.
So this would be the positive examples.
You get the real image and the contour.
And for the negative examples, so the fake pairs, the generator generates you the here
in this case the contour.
Bullshit.
This term.
It generates you the image given the contour image.
And the contour image then of course gets both the discriminator gets both the contour
image, the original one, and the generated one of the generator.
And the discriminator still has to decide is it a real image or is it a fake pair?
Now the problem with this here or what means problem, it's a really cool technique.
Now, but still you have to have a data set where you have these conditions.
So you have a data, you need a data set here in this case here with the hand backs and
some contours of the hand backs.
Otherwise it doesn't work.
But there is a trick which is known as cycle consistent guns.
And assume you have a data set of horses and you have a data set of zebras.
Of course you don't have the relation so you don't have a horse as a zebra.
You can now draw them hand drawn and make them with Photoshop, make them look like zebras,
but this costs a lot of time.
So what you want is that you get this, you generate from horses into zebra and zebra
into horses without explicitly having one of the mappings.
And this can be done by a cycle consistency loss.
So we couple the gun with a trainable inverse mapping F such that F, so another generator,
that's the input of G from X and this is again about to be X and G of F on Y is Y again.
So X comes from a set of one domain and Y comes from a set of the other domain.
So these two data sets.
And here maybe it's more clearer written.
So we have two discriminators here, so this is a discriminator and this is a discriminator
here and we have two generators here, X and Y.
And now we enforce that through the cycle gun loss that these are consistent.
So and this can be done by just taking here one loss for X which is drawn from one distribution
and Y which is drawn from the other distribution and we enforce that this loss gets minimized.
And if you look at this graph here, so X is a sample drawn from the set of images of one
distribution.
So this is X is here element of the full distribution X and this here is Y which is drawn from the
big Y, from this data set.
So this one here.
And this is my X here, sorry.
And yeah, you want to, this cycle consistency and by putting it in one domain through the
other generator and say that these two have to be similar to each other.
Is that clear?
Well, it's just a trick, you know.
You have now another pair of generator and discriminator and yeah, that works really,
really nice.
So the total loss, of course, you have the gun loss of the first pair discriminator generator
and you have the gun loss of the other pair of discriminator and generator.
And then additionally, you have the cycle loss with a specific, so you weight them differently
here, yeah.
Yeah, I don't know what this typically is.
I have not used it so far.
I have tried it once, but it didn't work so well for my data.
But anyway, in these data from which they use, which the authors use, it worked really
nice and it's quite often cited and used.
For example, if you have real photos and you want to have them in Monet-like or the other
way around, you have a Monet portrait and you want them as a photo, yeah.
And this method is very good at this point.
Also for zebra versus horses, it works really nice.
And if you look at the YouTube video, you see also that this works in real time.
And so if you have trained them and not, I don't know, yeah, I think there is also an
app doing this in real time.
Yeah, or summer to winter, you can modulate.
And of course, also with segmentation masks, this works.
So generating from segmentation masks, realistic looking output.
So this would be the ground truth and this would be what comes out of our cycle gun.
And this is the gun only in the backward pass, this is the forward pass, and this would be
only the gun alone.
So gun forward means in one direction and the other one gun in the other direction.
And this is with the cycle consistency introduced.
Yeah, and you see that this also matches quite nicely.
Yeah, there, oh, we have 10 minutes left.
Okay, wow.
Some more tricks of the trade.
So yeah, you had already heard about label smoothing.
Yes, you should.
And so label smoothing means that you don't choose directly one or zero, yeah, for the
targets.
In contrast, here, you replace that all for the true samples, yeah, you replace that with
some variation around 0.9, for example, or just with 0.9, that's sufficient.
And in this way, you are fighting against the vanishing gradients problem here for the
discriminator.
Remark, don't use this for the generator.
So don't replace the zero label, yeah, because otherwise you will, yeah, the table, it will
reinforce incorrect behavior, yeah.
So because then it says, oh, I'm not so really sure about that, yeah, that's not good, yeah,
because then G will produce samples that resemble the data or samples it already makes, yeah.
And yeah, the benefits are quite good here.
So we prevent D from giving very large gradient signal to G, and it prevents extrapolating
to encourage extreme samples.
So this is one trick.
Yeah, there are many more tricks.
Okay, now first, is it important to sample balancing G and D?
So if you look at the losses, they will look quite differently, yeah, if you train that.
And it's unfortunately not necessary, yeah, necessary.
So because GANs work by estimating this ratio of data and model density, and of course this
ratio is estimated only correct when D is optimal.
So it is okay if D is better.
If D is very good in saying if it's real or fake, typically G will still be a very good
generator.
But be careful, yeah, G's gradient may vanish if the discriminator gets too good, yeah.
And of course, you can prevent this a little bit by non-saturating loss or by this label
smoothing.
You can also prevent that the gradients get too large.
Yes, also GANs not only work on fully connected layers, they of course work also with CNNs.
And here, any pooling layer, you're using strided convolutions, so pooling is of course
not so good here.
And for in the generator, you are using transposed convolutions.
Now, so you have strided convolutions, transposed convolution, remove any fully connected hidden
layer for deeper architectures, use relu in the generators and leaky relu in the discriminators.
And yeah, careful, the output layer of G uses, is typically using a tonnage because it's
behaving better than sigmoid function, yeah.
But you need a real distribution here between, no, between, yeah, tonnage is between minus
one and one.
Okay.
And batch normalization.
However, if you use batch normalization in G, it might happen that you have very strange
output and very, it's a form of mode collapse, yeah.
You're generating very similar looking outputs in one batch, in one mini-batch, yeah.
And to prevent this, you can use virtual batch normalization.
So or there are some more tricks, yeah.
The problem is with, yeah, batch normalization, if you use it for both, yeah, then you have
this that it generates samples of the, which is a, which gets, uses the statistics of the
mini-batch, yeah.
And this is not so good.
So what you can do is use two separate batch normalizations, but much better is if you
use a virtual batch normalization.
And if that is too expensive, just use instance normalization.
So for each sample, subtract mean and divide by the standard deviation of the sample, yeah,
not of the data set.
Of the data set, you should do anyways, always.
And the virtual batch normalization works as follows.
You create in the very beginning a reference batch R, yeah, of random samples, yeah.
And then for each XI of the current mini-batch, you create a new virtual batch, yeah, which
comes from the current mini-batch and our reference mini-batch.
And then you compute mean and standard deviation of this virtual batch.
And of course, yeah, we need to always propagate forward in addition to the current batch.
And then we normalize X with these statistics.
So it's clear, yeah, if you use both times the same batch normalization, yeah, it will
fail or produce bad things.
So either use two separate batch normalizations here for the generator and the discriminator.
But typically this is one network, yeah, so the same architecture if you want to use shared
weights.
I know these two architectures are bullshit, yeah.
But you could use and then, yeah, you have to be careful.
And the switch of batch normalization is a trick, but it's also an overhead, yeah.
You can also do tricks which, yeah, which are quite natural.
You can add a penalty term that punishes wage which are rather far away from the historical
average.
So you compute the historical average of the parameters and put an addition term that this
has to be small, yeah.
And of course the historical average of the parameters can be updated online.
There are other tricks from the reinforcement learning domain such as experience replay.
So you keep a replay buffer of the past generations and occasionally show this to the discriminator.
Because it might be that the discriminator is very good at discriminating the current
state of the generator, but might fail to discriminate between old generated samples
of the generator.
And yeah, so you keep just checkpoints and occasionally swap them.
This is another way and do this for a few iterations.
So there are more tricks of course.
Yeah, some examples.
For example, bedrooms.
So you can generate bedrooms if you have a data set of bedrooms.
Yeah, image data set can generate artificial bedrooms.
Quite interesting or quite funny example is you can even do vector arithmetic, yeah.
You have, you generate an average of different men with glasses.
So you have the latent vector Z, yeah, which you use to generate a specific image.
And you're using here three faces which have glasses.
And all these three latent vectors you average.
And then you have a man with glasses, kind of, a latent vector for this representing
a man with glasses.
Then you do the same, for example, with men without glasses and a woman without glasses.
And if you subtract now the man without glasses and add the vector of women without glasses,
you get, you can generate then women with glasses.
Quite funny to play around.
So in some way, we see that GANs learn a distributed representation that disentangles the concept
of gender here from the concept of wearing glass.
And there's, of course, more work.
For example, the info GAN is one very early one in this domain.
Okay, there are, I don't know how much I can cover that advanced GAN methods.
Okay, let's talk a little bit about this problem because that's interesting.
So we have mode collapse.
That means that, so if G rotates now through the nodes, through the modes of the data distribution.
So the target, this would be our target here.
And the generator will try to use different modes here of the data distribution, but never
converges to a fixed distribution.
It fails basically to represent all the points of the target distribution.
And a possible reason is that the order is important here.
So if we use to train the discriminator in the inner loop, it would converge the correct
distribution.
But if G is in the inner loop, we place all mass on the most likely point here.
And in practice, of course, since we are training this simultaneously, both effects can appear.
And this is not so really what we want.
One solution to that is minimage discrimination of or unrolled GANs.
So there are two ways now to solve this problem.
And one way would be mini-batch discrimination.
And the intuition is here that we allow the discriminator to look at multiple samples
in combination to help G avoid the collapsing.
So you extract features from an intermediate layer and add a mini-batch layer that computes
for each of these features a similarity to all other samples of the mini-batch.
And then you just concatenate the similarity vector to each feature vector.
Yeah, it looks a little bit expensive, but you can do that.
And then you compute these mini-batch features separately for samples from G and from the
training data.
And D is normally output 0, 1, but now uses the similarity to all the samples in the mini-batch
as site information.
So you have not only what comes out from the whole encoder part or the generator, but also
additionally has the similarity to all the samples of the mini-batch.
So this is one way to avoid mode collapse.
Another way, and after this I would say we stop, is unrolled GANs.
And in unrolling GANs, the intuition is quite simple.
So what you do is you're training, you're updating the weights of the discriminator
after the very first step.
But the generator you are updating from an unrolled, so you're looking basically in the
future.
Yeah, you're unrolled for several steps, let's say five steps in the future, with the same
discriminator parameters.
You are updating, you are collecting all the gradients and updating the generator after
all these steps.
But the discriminator you're only updating from the very first step.
Yeah, and I think the concentration is down.
But just one last picture.
If you do this unrolled GAN, you will see after 25,000 steps here, so here the last
step here, we really have our target distribution.
And that's quite cool.
Of course, these are very toy examples, yeah.
But also in real samples.
I don't have any real samples.
Okay, I don't know if Andreas wants to continue from there on, or if he just skips it.
I'm not sure.
So if he skips, yeah, it's not so far anymore.
Okay, the addition of the Variational Autoencoder made it longer than I thought the talk here.
Or I was too small, too, yeah, not in pace.
Okay, are there any questions so far?
Otherwise, we close the session.
Okay, good.
Thank you.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:31:35 Min
Aufnahmedatum
2019-07-11
Hochgeladen am
2019-07-11 19:59:02
Sprache
en-US