5 - FAU MoD mini-workshop: Exact sequence prediction with transformers [ID:58176]

50 von 210 angezeigt

Okay, so we can start with the talk by Giovanni Fantuzzi, who is professor at the Department of

Mathematics at the University. Alright, thanks. So in the interest of time I'll try to keep it

short. I'd also like it to be informal, as informal as possible, so if you have questions

during the talk please interrupt and ask. I want to start by saying that this is work mostly by

Albert, who's a PhD student here in the chair, and also with Enrique, who has given a lot of the good

ideas behind this. Albert couldn't be here, so I'm presenting his work for him today. And the

problem we look at is these, I call it exact sequence prediction with transformers. So to

understand that let me just hopefully jump slide. So in a nutshell what is the problem? We have these

machine learning models which are transformers. I consider them to be a function. We give them some

input, like a sentence, that we want to be completed and we would like this machine to

spit out a set of possible completions for this sentence, like in this case dot, cat or mathematician,

right? We're known to be lazy. So of course maybe you want also probability of these outputs,

but to our first approximation let me just assume that this is the task. So how do we model this

mathematically? Well let's suppose that we take every word and we encode every word through a

process called tokenization as a vector. So now I have a bunch of vectors and I want my transformer

to spit out a different bunch of vectors that somehow can then maybe be decoded into words. So

more generally I have an input set or I'm going to call it an input sequence with n tokens. Every

token is a vector in RD and I would like to pass it through this transformer which takes the sequence

as input and spits out a different sequence with a different number perhaps of elements, usually less

in the application that we have in mind, also in the same dimension. So the capital letters in my

talk will denote sequences and small letters, lowercase letters, will denote vectors in RD.

Okay so this is the task. The question that we were asking is can we have exact prediction? So if I

give you pairs of input-outputs, this is like your training data, can you build a transformer that

predicts outputs from inputs exactly for all of your data? So the transformer applied to your

inputs should give exactly the output. So that's the question we are asking. The question you should

be asking is who cares? In practice you might not want to do exact prediction because maybe you

overfit so what's the point? And the point is that there are implications for practical training and

this is a mathy's lie that essentially rediscovers tick-on of regularization. So essentially if you

have a class of transformers that depends continuously on your parameters that you use in

the training, and you can solve the exact prediction problem, so there is a transformer

in that class that precisely classifies your inputs and turns them into the outputs,

then you can of course find the transformer with minimal parameter norm that does it and it turns

out that you can use that transformer in order to put an upper bound. Hopefully you see the

pointer here. So I can consider a standard training process where I have a cost function

which is the standard loss over the training set plus maybe some regularization of the parameters

with some parameter epsilon. Hopefully epsilon is not too large and you can estimate that the

optimal transformer here gives you a cost, so a total loss function which is bounded by the

norm of this precise exact transformer times epsilon. So you know that your best solution

should do better than this constant times epsilon and in fact you know that if you crank epsilon

towards zero then these optimal solutions will actually converge to a perfect classifier. So

maybe this sounds abstract but in pictures the situation looks like the following. So suppose

you start training, so you are training iterations in the horizontal axis and you measure your loss

on your training set and perhaps you see that after a few iterations your loss levels off,

right? Should you stop training? Well the question is well maybe the theory tells you that the minimum

loss should be below this dashed line, that's the question, the constant times epsilon for the

computation and if you knew this well you should say well no I need to keep training and then maybe

level off, no you should keep training again and only then you know that you have a good transformer.

Okay so that's the interest of why we care about this perfect classification, there are practical

implications for training. So again can we solve this perfect prediction problem? Of course the

answer depends on what is the transformer, what is the architecture. In fact this question has

Teil einer Videoserie :

FAU MoD Lecture and workshop: AI for maths and maths for AI

Presenters

Prof. Dr. Giovanni Fantuzzi

Zugänglich über

Offener Zugang

Dauer

00:22:21 Min

Aufnahmedatum

2025-06-24

Hochgeladen am

2025-06-25 07:18:10

Sprache

en-US

Date: Mon.-Tue. June 23 - 24, 2025

Event: FAU MoD Lecture & Workshop

Organized by: FAU MoD, the Research Center for Mathematics of Data at Friedrich-Alexander-Universität Erlangen-Nürnberg (Germany)

https://mod.fau.eu/fau-mod-lecture-ai-for-maths-and-maths-for-ai/

FAU MoD Lecture: Mon. June 23, 2025 at 16:00H

AI for maths and maths for AI

Speaker: Dr. François Charton, Meta | FAIR | École Nationale des Ponts et Chaussées

Mini-workshop: Tue. June 24, 2025 (AM/PM sessions)

FAU room: H11

AM session (09:45H to 11:30H)

• 10:00H The Turnpike Phenomenon for Optimal Control Problems under Uncertainty. Dr. Michael Schuster, FAU DCN-AvH Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship

• 10:30H AI in Mechanics Dr.-Ing. Hagen Holthusen, FAU MoD, Research Center for Mathematics of Data | Institute of Applied Mechanics

• 11:00H Contribution evaluation in Federated Learning Daniel Kuznetsov, Visiting Student at FAU DCN-AvH from ENS Paris-Saclay

PM session (14:15H to 16:00H)

• 14:15H AI for maths and maths for AI Dr.-Ing. François Charton, Meta | FAIR | ENPC

• 14:30H Exact sequence prediction with transformers Giovanni Fantuzzi, FAU MoD, Research Center for Mathematics of Data | FAU DCN-AvH at Friedrich-Alexander-Universität Erlangen-Nürnberg

• 15:00H Discovering the most suitable material model for cardiac tissue with constitutive neural networks Dr. Denisa Martonová, FAU MoD, Research Center for Mathematics of Data | Institute of Applied Mechanics

• 15:30H Stability of Hyperbolic Systems with Non-Symmetric Relaxation Dr. Lorenzo Liverani, FAU MoD, Research Center for Mathematics of Data | FAU DCN-AvH at Friedrich-Alexander-Universität Erlangen-Nürnberg

AUDIENCE. This is a hybrid event (On-site/online) open to: Public, Students, Postdocs, Professors, Faculty, Alumni and the scientific community all around the world.

WHEN

• Lecture: Mon. June 23, 2025 at 16:00H (Berlin time)

• Workshop: Tue. June 24, 2025 (AM/PM sessions) at 09:45H and 14:15H (Berlin time)

WHERE. On-site / Online

See more: www.mod.fau.eu/lectures

Tags

Per RSS abonnieren