So the agenda is, first we'll do a short introduction, then we look into how LLMs actually work,
especially the attention mechanism.
We go into scaling loss, and then we look into the different training types of LLMs.
We look into the evaluation and after that into parallelization techniques and also how
we use LLMs on our HPC systems.
First introduction, basically a little bit more than two years ago, JetGPT was released
and it definitely changed the world as we knew it before.
So for me it took away a lot of boring type of work, which I didn't like to do, and much
more productive at writing Python scripts or data processing, whatever.
I really like to use it there.
And this is the current state of the art leaderboard for both commercially available and some open
source LLMs.
We see OpenAI 01, it's a reasoning model, which is first close after a Steepseq, the
Chinese model, which was released some weeks ago.
You will see Claude, which is also a commercial model, but also a lot of open source models
like Croc and Llama3, for example.
And the reason open source model exists, that's the model we can use on our HPC systems, right?
So the closed ones we don't have access to.
So if we just look at the open source LLMs for this talk, there's also a separate leaderboard
from Hugging Face, which you can look up.
So the most popular open source LLMs are basically Llama, Llama3 series, Mistral AI,
especially for German.
I like Mistral models.
They're quite good at speaking German and European languages.
And the new kit on the blog Deepseq, which is yet hard to host, but has excellent results
for those benchmarks.
So let's hop right into how do these LLMs actually work.
So basically they're next word predictors.
And if we start with the sequence of words, so just for clarification, I use a sequence
of words instead of tokens for making it easier to understand.
So the actual LLMs produce tokens instead of words.
But if we have a sequence of, let's say, HPC is all we, the transformer, the GPT model
will predict probabilities of the next word, which is in this case, I think, need, right?
HPC is all we need.
This is basically the principle how it works.
And if we look into the details, so again, those words represent tokens.
They all have their correspondent embeddings.
For example, HPC has the meaning encoded into an embedding that is static in this case.
And then is has its own embedding, all has its own embedding and we, and so on.
And this is how it is in a static way.
So the embeddings are just looked up and they contain the meaning of the single word, which
is not too helpful yet.
So the GPT model uses instead tokens and that's to provide a balance between linguistic flexibility
and also the computational efficiency.
So they are split up, basically split up words, and so they can make better and understand
different languages better.
But there's still problems with numbers and also code.
And one point here is that tokenizers matter a lot.
But for the latest models, the tokenizers became quite good.
So even for let's say German language, which is globally a niche language, right?
Presenters
Zugänglich über
Offener Zugang
Dauer
00:47:09 Min
Aufnahmedatum
2025-03-19
Hochgeladen am
2025-03-19 11:26:05
Sprache
en-US
Large Language Models (LLMs) are revolutionizing the way we interact with artificial intelligence, and the open-source community plays a pivotal role in driving their accessibility and innovation. This talk delves into the inner workings of LLMs, exploring their foundational mechanisms and architectures. Additionally, we examine how these models can be efficiently trained on high-performance computing (HPC) systems, leveraging state-of-the-art scaling strategies and principles derived from scaling laws. By understanding these methodologies, attendees will gain valuable insights into the challenges and opportunities of developing and deploying LLMs in diverse computational environments.