I think I can start now. So today is our final lecture and finally we get to a topic which
is more relevant to the key mystery of the deep neural network that is about its generalization.
As I told you about, so the generalization puzzle is kind of the most prominent problem
that's there. It is proposed by Leo Bremer at least as early as 1995 and till now it's like 30
years and we haven't been able to solve it. And now, so we did not really start by looking at
this kind of generalization problem directly. However, after all these phenomenas, all these
kind of preparations, it's time to get into this problem and also we can now forget about all these
existing framework that we could use to talk about generalization theory. But just think about what
kind of generalization description we could have based on all these solid phenomenon or solid
understandings we have for this system. Okay, yeah, it is our last lecture and still condensation
is a key phenomena that inspires all these understanding, new understandings of the
generalization. Okay, and first, so things we observe this condensation phenomena, we know that
there are conditions for condensation, right? And then there's a natural question is that,
since I know by tuning the scale of initialization, I can obtain a kind of, I can let the network in a
region with stronger condensation or less condensation. And then we ask the question,
what is the real advantage of, generalization advantage of condensation? In what kind of
situation condensation really brought about benefits on generalization? But when we think
about this problem, we always need to keep this no free lunch theorem in our mind. That tells you,
okay, you shouldn't expect a theoretical argument that tells you, okay, condensation is always good.
There's no way we could arrive at any conclusion like that. Therefore, all we are trying to say is
that for what kind of problem or what type of problem condensation could really help. So that's
the question we should really ask. Okay, and then there are basic observations we could make. For
example, we know that the condensation can help us avoid this type of like overfitting, meaning that
using this kind of highly oscillatory pattern to fit this training data, because that would require
too many groups of neurons, right? And far more than the minimum number of groups you could ever
use to interpolate this function. Okay, so you could easily observe these kind of phenomenon.
Okay, and what's more we can say about the condensation? Yeah, so we can see if we just
think about, okay, this type of relatively smooth interpolation of all this training data,
and also frequency principle is enough to help us understand this kind of a non-overfitting behavior.
So in which sense condensation bring us, give us kind of more benefits regarding the generalization,
and that's a question we really want to understand. But before we go into the details,
I also want to tell you in which sense these kind of generalization issues are, what kind of problems
we really have about the generalization that is relevant to the large language models. And we can
see there's a very important paper which is called Scaling Laws for the Neural Language Models by
OpenAI in 2020. How many people have ever read this paper? Anyone? Then I could say you are
lagging behind. So this paper is the most important paper in AI that really tell us, okay,
there are three factors that are key to the performance of the large language model,
which are the model size and data size, and also the computation you use for the training. And why
this paper is so important? Because we know that it is these neural networks are largely a black
box, right? And there's no theory that really predicts what are the most important factors for
its performance, right? We don't have these kind of theoretical arguments or actually a solid
understanding. Therefore, the only thing we can rely on, it's kind of very careful experiments,
right? And it is in this work, the OpenAI, it compared so many different kind of factors,
it tuned lots of different things in order to find out what are the factors that are the most
important for the performance of the large language models. And then we find out, okay,
there are three factors. So what's the relation between these three factors to the performance,
right? And it has this, for example, particularly regarding this training data size, it says,
if I increase the training data size, using this as x-axis, and then here's a testing loss,
and then there's, we can fit this power law relation. So these are in the log log scale,
and there's a kind of looking like a power law relation between the test loss and the data size.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:52:28 Min
Aufnahmedatum
2025-05-08
Hochgeladen am
2025-05-09 03:29:05
Sprache
en-US
FAU MoD Course: Towards a Mathematical Foundation of Deep Learning: From Phenomena to Theory
Session 4: From Condensation to Loss Landscape Analysis
Speaker: Prof. Dr. Yaoyu Zhang
Affiliation: Institute of Natural Sciences & School of Mathematical Sciences, Shanghai Jiao Tong University
1. Mysteries of Deep Learning
2. Frequency Principle/Spectral Bias
3. Condensation Phenomenon
4. From Condensation to Loss Landscape Analysis
5. From Condensation to Generalization Theory