Hello, everyone.
My name is Sébine Labère.
I work at NVIDIA as a Senior Solution Architect.
I'm actually based in Munich.
Today we travel to Erligen to visit you.
Today I wanted to
discuss with you about distributed inference and particularly my data here in France in the age of
reasoning model.
So with the rise of reasoning model that inference became more and more
complicated.
There's actually a compute intensive and I'm going to explain the key concept and
optimization that actually make that possible.
Okay, so over the past decades we actually have
AI going through multiple stages.
So 10, 15 years ago, there were the first stages.
So pre-training
where it was more like computer vision with CNN where we have large data set.
We are training the
model and then we could just do inference on it.
Then we had large English model with post-training
scaling.
So we would actually train generic model where you could fine tune afterwards on specific
tasks.
So we had like generic model that would need to be adapted later on.
Now we are entering
the phase of reasoning where you still have the LLM but they actually go through multiple steps of
inference to reason, so to go through a thinking process.
So this is now much more compute intensive and on top of that we are also going with agentic where
actually behind one request from prompt we have not just one LLM but also like multiple
that will trigger some actions and toolings.
So it's becoming more and more just doing
inference will require much more compute.
So why that actually because we have just explained like larger models.
So we go now we
are like at the hundred or even trillions of parameter much longer thinking process.
Also
the contexts are getting much more larger and go to the millions now and like I just explained
when you with agents so one prompt will actually involve multiple model executions.
So I'm going to cover what is really the base behind for inference and LLM execution
because it's very important for the rest of the presentation so maybe you know KVCache
or something but I want to explain it just to be sure.
So when you do a LLM inference there's two
phases.
The first one is the first one is the second one is the third one and the fourth one is
LLM inference.
There's two phases.
The first one is the pre-fill phase which is everything until
the generation of the first token after your the prompt.
So here will be the generation of the
word pre-fill even that is my prompt.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:41:17 Min
Aufnahmedatum
2025-11-11
Hochgeladen am
2026-01-20 17:10:37
Sprache
en-US
Slides
Abstract:
This presentation explores how distributed and disaggregated inference techniques enable scalable execution of large language models (LLMs), particularly in the context of reasoning and agentic AI. It highlights architectural optimizations such as KV caching, prefix reuse, KV-cache aware routing and KV-cache offloading which improve performance, reduce latency, and support efficient deployment at the cluster level of inference workloads.
Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/