Okay, so we can start with the talk by Giovanni Fantuzzi, who is professor at the Department of
Mathematics at the University. Alright, thanks. So in the interest of time I'll try to keep it
short. I'd also like it to be informal, as informal as possible, so if you have questions
during the talk please interrupt and ask. I want to start by saying that this is work mostly by
Albert, who's a PhD student here in the chair, and also with Enrique, who has given a lot of the good
ideas behind this. Albert couldn't be here, so I'm presenting his work for him today. And the
problem we look at is these, I call it exact sequence prediction with transformers. So to
understand that let me just hopefully jump slide. So in a nutshell what is the problem? We have these
machine learning models which are transformers. I consider them to be a function. We give them some
input, like a sentence, that we want to be completed and we would like this machine to
spit out a set of possible completions for this sentence, like in this case dot, cat or mathematician,
right? We're known to be lazy. So of course maybe you want also probability of these outputs,
but to our first approximation let me just assume that this is the task. So how do we model this
mathematically? Well let's suppose that we take every word and we encode every word through a
process called tokenization as a vector. So now I have a bunch of vectors and I want my transformer
to spit out a different bunch of vectors that somehow can then maybe be decoded into words. So
more generally I have an input set or I'm going to call it an input sequence with n tokens. Every
token is a vector in RD and I would like to pass it through this transformer which takes the sequence
as input and spits out a different sequence with a different number perhaps of elements, usually less
in the application that we have in mind, also in the same dimension. So the capital letters in my
talk will denote sequences and small letters, lowercase letters, will denote vectors in RD.
Okay so this is the task. The question that we were asking is can we have exact prediction? So if I
give you pairs of input-outputs, this is like your training data, can you build a transformer that
predicts outputs from inputs exactly for all of your data? So the transformer applied to your
inputs should give exactly the output. So that's the question we are asking. The question you should
be asking is who cares? In practice you might not want to do exact prediction because maybe you
overfit so what's the point? And the point is that there are implications for practical training and
this is a mathy's lie that essentially rediscovers tick-on of regularization. So essentially if you
have a class of transformers that depends continuously on your parameters that you use in
the training, and you can solve the exact prediction problem, so there is a transformer
in that class that precisely classifies your inputs and turns them into the outputs,
then you can of course find the transformer with minimal parameter norm that does it and it turns
out that you can use that transformer in order to put an upper bound. Hopefully you see the
pointer here. So I can consider a standard training process where I have a cost function
which is the standard loss over the training set plus maybe some regularization of the parameters
with some parameter epsilon. Hopefully epsilon is not too large and you can estimate that the
optimal transformer here gives you a cost, so a total loss function which is bounded by the
norm of this precise exact transformer times epsilon. So you know that your best solution
should do better than this constant times epsilon and in fact you know that if you crank epsilon
towards zero then these optimal solutions will actually converge to a perfect classifier. So
maybe this sounds abstract but in pictures the situation looks like the following. So suppose
you start training, so you are training iterations in the horizontal axis and you measure your loss
on your training set and perhaps you see that after a few iterations your loss levels off,
right? Should you stop training? Well the question is well maybe the theory tells you that the minimum
loss should be below this dashed line, that's the question, the constant times epsilon for the
computation and if you knew this well you should say well no I need to keep training and then maybe
level off, no you should keep training again and only then you know that you have a good transformer.
Okay so that's the interest of why we care about this perfect classification, there are practical
implications for training. So again can we solve this perfect prediction problem? Of course the
answer depends on what is the transformer, what is the architecture. In fact this question has
Presenters
Zugänglich über
Offener Zugang
Dauer
00:22:21 Min
Aufnahmedatum
2025-06-24
Hochgeladen am
2025-06-25 07:18:10
Sprache
en-US