69 - HPC Cafe on July 09, 2024: GROMACS 2024 usage and performance on modern CPUs and GPUs [ID:53534]
50 von 220 angezeigt

Hi, my name is Anna and I'm going to talk about Bromix usage and performance on CPU

My background is in molecular science, which is a mixture of chemistry, biology and pharmacy.

And once I discovered that we can do chemistry on the computer, I was looked and joined the

bioinformatics group for my PhD on protein dynamics and have been responsible for Bromix

benchmarks for the past five years. I'm going to talk about the Bromix benchmarks and various

systems. And in the end, I will show you that advanced program knowledge can lead to remarkable

performance speed ups. Let's start with the benchmarks that are used. It is a set of six

benchmarks. The first one is from material science and it has a very high output rate.

So it writes a lot to the disk. We cannot really use the numbers reliably because that is really

a lot of output that normally Bromix users don't need. We had number three, which is a protein

inside the membrane as a benchmark for one of our systems during procurement. And number four is

the default benchmark at the moment that we use to evaluate new hardware. The number six is the STMV

benchmark. If anyone is familiar with benchmarking and these simulations and programs,

NAMD, Amber and Bromix all use the STMV benchmark for benchmarking. So it's also included in

my benchmark set. Next we come to the CPU systems, which I ran the benchmarks. And apart from giving

you the names, I'm going to quickly show you how I... Yes. Just one thing. Why is the first

benchmark not reliable? I mean, the output is MD.log, right? And we can see all the outputs in...

Yes. I mean, the frequency of the output is not comparable to the frequency of the output we

normally use in Bromix. So they needed this output like every two femtoseconds or something

to calculate something for their fluids. But you normally don't need for MD simulations

to a femtosecond output. So that's why I have included it to see what effect it has on

the different hardware. But it's nothing we can really compare the other benchmark systems to.

Yeah. I'm going to show you how I installed it and how I used it, which commands around for

benchmarking. At NHR at FAU, we have several CPU systems available. There's the Ice Lake

inbuilt CPUs in Woody and Fritz. And we also have the Sapphire Rapids in Fritz. We have in our test

cluster AMD CPUs. We're benchmarked the Genoa, Genoax, and Bergamo. And we also have an ARM system,

the RACE system from Nvidia, which I wanted to try and see what performance we can get out

of the ARM systems. The installation on CPU is quite simple. I always use SPAC because it

takes care of all the dependencies that we need, the different libraries. For the AMD CPUs, I

followed the recommendations on the AMD website and AMD with the AMD FFT libraries and all of that

that is special for AMD. And for the ARM CPUs, I tried different settings with real MPI and

thread MPI and different libraries. And yeah. And what I'm going to show you, the performance results

are the best results that I got out of it. And so it varies for the ARM CPUs. And it also varies.

And we're going to see that on the next slide where I show you how I ran the benchmarks that I

go over the open MP and MPI threads. It's a hybrid parallelization technique that Bromix uses. And

yeah, I increment the open MP threads and the MPI threads. And then the best combination

will be put in my table that I'm going to show you on the next slide. Yeah, we can use

hyperthreading in Bromix. It can be beneficial, but it has to be tested. It depends on the system.

It depends on the hardware. For non-Intel CPUs, I removed the rank reordering when initializing the

domain decomposition. It somehow slows it down. And I also gave Bromix the topology

on the CPUs so that it doesn't have to figure out where the CPUs are. The main difference between

real MPI and the thread MPI version is an internal MPI in Bromix. It's not the real MPI. That's why

I call the other one real MPI. Where we use the MPI run for the threads and then start the Bromix

in MPI mode. Let's come to the results here. All of this with the numbers, it depends on the system.

The larger the system, the lower the performance numbers. What we can see is that when we compare

these and these systems, that the performance increases with the number of cores. The Ice Lake

on Woody has 32 cores and the Ice Lake on Fritz has 72 cores. The scaling on CPUs is

frequency. CPU frequency times the CPU cores. In order to compare the two medium-sized benchmarks

with each other, I generated the following graph. Please take care that the y-axis is not the same.

On the left for system three, it goes to 200 nanoseconds per day. For system four, it's

Teil einer Videoserie :
Teil eines Kapitels:
HPC Café

Zugänglich über

Offener Zugang

Dauer

00:29:00 Min

Aufnahmedatum

2024-07-09

Hochgeladen am

2024-07-12 14:26:16

Sprache

en-US

Speaker: Dr. Anna Kahler, NHR@FAU

Slides: https://hpc.fau.de/files/2024/07/2024-07-09_NHR@FAU_HPC-Cafe_Gromacs-Benchmarks.pdf

Abstract:

The first patch for GROMACS 2024 was released at the end of February, and recently, several cutting-edge CPU and GPU architectures have been integrated into the NHR@FAU test cluster. After benchmarking the usual set of six simulation systems on these new architectures, we are excited to share the results with NHR@FAU users, as they reveal some fascinating insights. Additionally, we will present three support cases that demonstrate how individual adjustments can significantly optimize performance.

Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/

Einbetten
Wordpress FAU Plugin
iFrame
Teilen