105 - HPC Café: Inference in the Age of Reasoning Models [ID:61350]

50 von 388 angezeigt

Hello, everyone.

My name is Sébine Labère.

I work at NVIDIA as a Senior Solution Architect.

I'm actually based in Munich.

Today we travel to Erligen to visit you.

Today I wanted to

discuss with you about distributed inference and particularly my data here in France in the age of

reasoning model.

So with the rise of reasoning model that inference became more and more

complicated.

There's actually a compute intensive and I'm going to explain the key concept and

optimization that actually make that possible.

Okay, so over the past decades we actually have

AI going through multiple stages.

So 10, 15 years ago, there were the first stages.

So pre-training

where it was more like computer vision with CNN where we have large data set.

We are training the

model and then we could just do inference on it.

Then we had large English model with post-training

scaling.

So we would actually train generic model where you could fine tune afterwards on specific

tasks.

So we had like generic model that would need to be adapted later on.

Now we are entering

the phase of reasoning where you still have the LLM but they actually go through multiple steps of

inference to reason, so to go through a thinking process.

So this is now much more compute intensive and on top of that we are also going with agentic where

actually behind one request from prompt we have not just one LLM but also like multiple

that will trigger some actions and toolings.

So it's becoming more and more just doing

inference will require much more compute.

So why that actually because we have just explained like larger models.

So we go now we

are like at the hundred or even trillions of parameter much longer thinking process.

Also

the contexts are getting much more larger and go to the millions now and like I just explained

when you with agents so one prompt will actually involve multiple model executions.

So I'm going to cover what is really the base behind for inference and LLM execution

because it's very important for the rest of the presentation so maybe you know KVCache

or something but I want to explain it just to be sure.

So when you do a LLM inference there's two

phases.

The first one is the first one is the second one is the third one and the fourth one is

LLM inference.

There's two phases.

The first one is the pre-fill phase which is everything until

the generation of the first token after your the prompt.

So here will be the generation of the

word pre-fill even that is my prompt.

Teil einer Videoserie :

HPC4FAU / NHR@FAU

Teil eines Kapitels:

HPC Café

Presenters

Dr. Georg Hager

Zugänglich über

Offener Zugang

Dauer

00:41:17 Min

Aufnahmedatum

2025-11-11

Hochgeladen am

2026-01-20 17:10:37

Sprache

en-US

Speaker: Dr. Séverine Habert, NVIDIA

Slides

Abstract:
This presentation explores how distributed and disaggregated inference techniques enable scalable execution of large language models (LLMs), particularly in the context of reasoning and agentic AI. It highlights architectural optimizations such as KV caching, prefix reuse, KV-cache aware routing and KV-cache offloading which improve performance, reduce latency, and support efficient deployment at the cluster level of inference workloads.

Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/61350

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/61350&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Audio

Per RSS abonnieren