Yeah, first, there's some of these problems that are just taught in a hurry.
But today, let's go back to these kind of dynamics.
I think first of all, it's a very kind of easy dynamic because you know that it's a
kind of equation, linear ODE. It's like linear ODE, but actually you can see that it's in the
frequency domain. So let's look at this P here. So this P refers to an empirical distribution,
which in general, it's a kind of a summation of deltas because we only sample at discrete
points. And let's think about the case where there's no this P here. There's no this
empirical distribution. Therefore, this H is just a Fourier transform of the function we care about.
It's a function of the neural network. And therefore, what we can see, suppose this F,
it's in there's a no P here. So this F is a target function. And then this H is the function,
neural network function. And then it tells you, okay, the evolution of the neural network function
in the Fourier space, it's kind of actually, or the dynamics of this neural network function
is decoupled in the Fourier space. You can see each of these Fourier modes, for example,
given certain case, you can see their evolution are kind of independent. That they have no
coupling between different frequencies. Then what happens is that each of these terms, for example,
H, initially probably it's a kind of zero function. Therefore, all these frequencies are zero. But
later, since there's this term, arrow term that drives this, the Fourier transform of the H to
the Fourier transform of this target function. And later, all these components in the Fourier
domain will converge to the corresponding kind of Fourier components in the target.
So and all independently. It's just, yeah, please.
H and HP is different because you see that's a, here's a notation. So if I put this small P here,
that means some function times this P. This P is generally, for example, here it's a, it tells
you only the information at a certain empirical samples are observed. Therefore, there's some
delta here that marks that fact. So without this P, so you can see it's a very simple dynamics,
which means each component in the Fourier domain converges to that component of the
target. And the decaying factor is determined by this pre-factor. And it's decaying in a power
law manner in the Fourier domain. That means, okay, the lower frequency part, all of this
component will converge to that Fourier component of the target. However, it's just that lower
frequencies converges faster, right? The higher frequency converges more slowly. However, what's
really interesting is not about that. It's about when we only have the observation at finite samples,
right? That means what we have, what we can observe about the target, it's only at these training
samples. Nowhere else we know the information of the target. And also in this arrow term,
we only care about whether the neural network output function evaluated at these training
samples. We only care whether they match that of the target. Therefore, so we ask both HP and
FP kind of only care about the information of these functions at a finite point. And then,
so with this kind of driving terms, what we can see is that there could be some aliasing effect.
For example, if like this f, the target is kind of a high frequency oscillation. However,
we only observe these target function at these kind of sparse kind of points. Therefore,
you can see aliasing could happen in a sense that H not necessarily equal to P, right? For example,
if H here is this kind of yellow line, still you can see HP Fourier transform is equal to the FP
Fourier transform because at these training samples, these two function, this yellow line
and this target function aligns perfectly. Therefore, these arrow terms will become zero.
So that means, the aliasing means, okay, if your data is not kind of dense enough,
then lower frequency Fourier mode can align exactly with a higher frequency Fourier mode.
Therefore, there's no way to distinguish so whether it's a kind of higher frequency mode,
and both of them is able to explain these observations on these sparse grids. And then,
there we ask, okay, since we can use lower frequency modes to explain things and we can
also use higher frequency modes or much higher frequency modes to explain these training data,
what kind of solution will neural network learn? But now, you see, this dynamics tells you, okay,
we prioritize over the lower frequency part. That means, although the higher frequency modes
Presenters
Zugänglich über
Offener Zugang
Dauer
01:55:01 Min
Aufnahmedatum
2025-05-06
Hochgeladen am
2025-05-07 16:09:53
Sprache
en-US
FAU MoD Course: Towards a Mathematical Foundation of Deep Learning: From Phenomena to Theory
Session 3: Condensation Phenomenon
Speaker: Prof. Dr. Yaoyu Zhang
Affiliation: Institute of Natural Sciences & School of Mathematical Sciences, Shanghai Jiao Tong University
1. Mysteries of Deep Learning
2. Frequency Principle/Spectral Bias
3. Condensation Phenomenon
4. From Condensation to Loss Landscape Analysis
5. From Condensation to Generalization Theory