Hi, my name is Anna and I'm going to talk about Bromix usage and performance on CPU
My background is in molecular science, which is a mixture of chemistry, biology and pharmacy.
And once I discovered that we can do chemistry on the computer, I was looked and joined the
bioinformatics group for my PhD on protein dynamics and have been responsible for Bromix
benchmarks for the past five years. I'm going to talk about the Bromix benchmarks and various
systems. And in the end, I will show you that advanced program knowledge can lead to remarkable
performance speed ups. Let's start with the benchmarks that are used. It is a set of six
benchmarks. The first one is from material science and it has a very high output rate.
So it writes a lot to the disk. We cannot really use the numbers reliably because that is really
a lot of output that normally Bromix users don't need. We had number three, which is a protein
inside the membrane as a benchmark for one of our systems during procurement. And number four is
the default benchmark at the moment that we use to evaluate new hardware. The number six is the STMV
benchmark. If anyone is familiar with benchmarking and these simulations and programs,
NAMD, Amber and Bromix all use the STMV benchmark for benchmarking. So it's also included in
my benchmark set. Next we come to the CPU systems, which I ran the benchmarks. And apart from giving
you the names, I'm going to quickly show you how I... Yes. Just one thing. Why is the first
benchmark not reliable? I mean, the output is MD.log, right? And we can see all the outputs in...
Yes. I mean, the frequency of the output is not comparable to the frequency of the output we
normally use in Bromix. So they needed this output like every two femtoseconds or something
to calculate something for their fluids. But you normally don't need for MD simulations
to a femtosecond output. So that's why I have included it to see what effect it has on
the different hardware. But it's nothing we can really compare the other benchmark systems to.
Yeah. I'm going to show you how I installed it and how I used it, which commands around for
benchmarking. At NHR at FAU, we have several CPU systems available. There's the Ice Lake
inbuilt CPUs in Woody and Fritz. And we also have the Sapphire Rapids in Fritz. We have in our test
cluster AMD CPUs. We're benchmarked the Genoa, Genoax, and Bergamo. And we also have an ARM system,
the RACE system from Nvidia, which I wanted to try and see what performance we can get out
of the ARM systems. The installation on CPU is quite simple. I always use SPAC because it
takes care of all the dependencies that we need, the different libraries. For the AMD CPUs, I
followed the recommendations on the AMD website and AMD with the AMD FFT libraries and all of that
that is special for AMD. And for the ARM CPUs, I tried different settings with real MPI and
thread MPI and different libraries. And yeah. And what I'm going to show you, the performance results
are the best results that I got out of it. And so it varies for the ARM CPUs. And it also varies.
And we're going to see that on the next slide where I show you how I ran the benchmarks that I
go over the open MP and MPI threads. It's a hybrid parallelization technique that Bromix uses. And
yeah, I increment the open MP threads and the MPI threads. And then the best combination
will be put in my table that I'm going to show you on the next slide. Yeah, we can use
hyperthreading in Bromix. It can be beneficial, but it has to be tested. It depends on the system.
It depends on the hardware. For non-Intel CPUs, I removed the rank reordering when initializing the
domain decomposition. It somehow slows it down. And I also gave Bromix the topology
on the CPUs so that it doesn't have to figure out where the CPUs are. The main difference between
real MPI and the thread MPI version is an internal MPI in Bromix. It's not the real MPI. That's why
I call the other one real MPI. Where we use the MPI run for the threads and then start the Bromix
in MPI mode. Let's come to the results here. All of this with the numbers, it depends on the system.
The larger the system, the lower the performance numbers. What we can see is that when we compare
these and these systems, that the performance increases with the number of cores. The Ice Lake
on Woody has 32 cores and the Ice Lake on Fritz has 72 cores. The scaling on CPUs is
frequency. CPU frequency times the CPU cores. In order to compare the two medium-sized benchmarks
with each other, I generated the following graph. Please take care that the y-axis is not the same.
On the left for system three, it goes to 200 nanoseconds per day. For system four, it's
Presenters
Zugänglich über
Offener Zugang
Dauer
00:29:00 Min
Aufnahmedatum
2024-07-09
Hochgeladen am
2024-07-12 14:26:16
Sprache
en-US
Speaker: Dr. Anna Kahler, NHR@FAU
Slides: https://hpc.fau.de/files/2024/07/2024-07-09_NHR@FAU_HPC-Cafe_Gromacs-Benchmarks.pdf
Abstract:
The first patch for GROMACS 2024 was released at the end of February, and recently, several cutting-edge CPU and GPU architectures have been integrated into the NHR@FAU test cluster. After benchmarking the usual set of six simulation systems on these new architectures, we are excited to share the results with NHR@FAU users, as they reveal some fascinating insights. Additionally, we will present three support cases that demonstrate how individual adjustments can significantly optimize performance.
Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/