Hello everyone to the HPC Cafe.
I hope you got your cup of coffee and your cake here
locally. Welcome everyone in the Zoom call. Our topic today is assessing the efficiency
of your jobs with Cluster Cockpit. Of course, it's a shameless plug to advertise our Cluster
Cockpit tool.
However, we're well aware that not everybody is savvy in interpreting all the data
that Cluster Cockpit gives you.
So what my goal is now in the first couple of minutes is to show
you a little bit of the patterns that we can see when we assess the performance of parallel codes
on clusters.
What could happen, so to speak.
And then later Jan will show you how to identify these
patterns in the graphs and the data that Cluster Cockpit gives you.
Okay, so that's the goal for
today. If there's any questions you can speak up all the time. You can keep it interactive. You
could also put questions in the chat and we could do a question and answer at the end. Now if I
could get the next slide. Okay, so what happens sometimes, and some of you may have noticed that,
is that you get an email from our support saying
dear user
many of your jobs are not properly
utilizing the allocated GPU resources and there are some links to the monitoring
which is already
a link to the Cluster Cockpit website.
Please optimize your code to ensure efficient usage.
You can check it yourself when logging into our monitoring system Cluster Cockpit at
monitor-reconn.hiv.ud and so on and so on.
So this is one of the typical patterns some of our support
staff have noticed that you're allocating a GPU but you're not using it.
So this is why you get
this email. So that's a very typical pattern. There are a couple of others and the text is
usually a little bit different then but yeah you get the drift. Okay, so let me show you a list of
these patterns and then some details about what those patterns mean.
So first of all, the two main
reasons for bad performance and that means also the stuff that is easily fixed so to speak because
they are like what I call operational issues is that you allocate too many nodes and you're not
using all of them or you're allocating some resources that you're not going to use or you
oversubscribe the resources. Okay, you don't allocate enough resources for the stuff that
you're running so that for example on a compute node more processes or threads are running than
our threads
hyper threads or cores available although in the latter case usually the batch
system or whatever you use to bind your threads will warn you about that but it can still happen
that you oversubscribe.
The first case is by far the most common.
Okay, that people allocate
resources that they're not using or allocating 10 nodes using only one core because accidentally
you call their own binary, yeah, stuff like that and that's what I mean by easy to fix.
Now the other group of bad performance reasons are reasons that are not so easy to fix because
they have some inherent issues what I call workflow or code issues.
So something that is
inherent to the way you start your program maybe utilizing too many nodes for the workload which
Presenters
Zugänglich über
Offener Zugang
Dauer
01:05:40 Min
Aufnahmedatum
2025-10-14
Hochgeladen am
2025-11-06 21:00:06
Sprache
en-US
Would it not be great if you could observe by yourself how your jobs are doing on the cluster instead of waiting for that e-mail from the admins telling you that you are not making good use of the resources? We have good news: it’s possible! You can monitor your past and running jobs in real time, assessing important things such as load imbalance, flop/s performance, communication, I/O, and more. And if you don’t know what all of this means, that is an even better reason to attend our HPC Cafe this week!