Welcome to today's HPC Cafe. It's nice to meet so many people here in person. It's exceptional
and I'm really happy about it. So thanks a lot for coming here. There are a lot of people
in the Zoom. So whenever you have a question put it in the Zoom chat, just speak up, raise
your hand, whatever. So let's keep this interesting. My goal today is to topic computer architecture
101 for scientists because we know that sometimes people who use our parallel computers or
any parallel computer are sometimes challenged or intrigued by the complexity of the hardware
they're using and sometimes they don't even know what to do. They're just using the scripts
and setups they inherited from their goal workers without actually thinking what's
going on and that leads to some problems, some of which I will detail. So I think that it's
good to tell you a little bit about how our computer works. Our super computer works and
what the basic documents are that you're up against when you're running parallel code
and a super computer without going too much into detail. Of course, I could talk about this
for a week or so without pause. Bear with me. I think I can keep it below 45 minutes. All right.
So that's my plan. I can tell you something about hardware architecture, parallel programs,
how they map to the hardware a little bit about performance, what it is, how we can assess
it and judge it, limitations of parallelism. So how parallel program B and some best practices
that give you some kind of guideline, what to do, how to assess the performance of your
code and the scalability of the code. So let's start simple. This slide shows the anatomy
of a CPU compute cluster and that's a very simple level. We're going to be going to
more detail later. But the purpose of a parallel computer, any computer that matter is to
execute code, execute instructions. And for you as a scientist, during the merits,
even more specific, the purpose of a computer is to do arithmetic, usually floating on
Earth. And that happens in the so called execution units of a compute core. Each processor
today has a couple of compute cores, maybe two, if you have a Wimpy tablet PC or maybe 36,
if you are on a Fritz compute node, but it's always a couple of cores, each core can execute
a program, maybe two, if hypergolim is enabled, but that's not true into these details.
And this is the stuff that happens to make your program work to solve your problem. This
happens in the so called execution units. And these execution units, they do the modifications,
divisions, divisions, whatever that solve your numerical problem. And they make up for
a surprisingly small amount of transistors. If you look at the whole computer as a bunch
of transistors, then the execution units are really tiny bits of that. The rest goes
into other stuff. And the other stuff comprises, for example, as one until two caches, caches
are small, fast memories. And the CPU tries to keep data that you use very often within
these caches. So that if you reuse it from time to time, then you can access data more quickly.
It turns out, if you dig deeper into this, that data transfer and data access delay is
the most important performance limiting factor in computer. So if you can somehow make
your program use less data or use data more that is close to the CPU, like in the L1
and L2 cache, then your program gets in there to be faster. So this is the number one
often like data transfer. That's why we have these caches. And one caches typically are
a couple of tens of kilobytes, like 32, 48, 64 kilobytes. L2 caches are the order of
half a megabyte, one megabyte. And so that's what the CPU software is that make up on core.
Hit that attached to one core. In each core has that, each core has registers, execution
units, a one that you cache nowadays. Now on a chip, usually we have a couple of course,
as I said, two, maybe the minimum. Nowadays you can buy chips with up to 36 cores from
Intel and even more from others. And a bunch of those cores are put on the chip and they
usually share a common L2 cache. Now how much of the caches share, that dependably
particular architecture, but usually the L3 cache is not private to each core. It's shared
among a bunch of cores. And that's important as you can see later.
So let's say up to 64-ish cores nowadays on a chip. And that's the thing that you put
on a socket. So what we call a socket, that's the thing that you find the shop. So the CPU
Presenters
Zugänglich über
Offener Zugang
Dauer
00:47:46 Min
Aufnahmedatum
2023-06-13
Hochgeladen am
2023-06-17 00:16:03
Sprache
en-US
Topic: Computer Architecture 101 for Scientists, and what it means for your cluster jobs
Speaker: Dr. Georg Hager, Head of Training & Support, NHR@FAU
Abstract:
Not knowing how a supercomputer works can literally cost you the better part of your precious resource allocation. We introduce the basic concepts of parallel computer architecture and how they impact the way cluster users should think about their compute jobs. Cores, sockets, caches, memory, networking, and data storage all play a role for performance; as a user, knowing your way around the supercomputer’s architecture can help you to more “bang” for your CPU/GPU time “buck.” If you are intrigued (or even intimidated) by all this “complicated” hardware stuff but always wanted to know more, this event is for you. Turns out, it’s not that complicated at all.
Material from past events is available on: https://hpc.fau.de/teaching/hpc-cafe/