100 - HPC Café: Assessing the efficiency of your jobs with ClusterCockpit [ID:60635]

50 von 644 angezeigt

Hello everyone to the HPC Cafe.

I hope you got your cup of coffee and your cake here

locally. Welcome everyone in the Zoom call. Our topic today is assessing the efficiency

of your jobs with Cluster Cockpit. Of course, it's a shameless plug to advertise our Cluster

Cockpit tool.

However, we're well aware that not everybody is savvy in interpreting all the data

that Cluster Cockpit gives you.

So what my goal is now in the first couple of minutes is to show

you a little bit of the patterns that we can see when we assess the performance of parallel codes

on clusters.

What could happen, so to speak.

And then later Jan will show you how to identify these

patterns in the graphs and the data that Cluster Cockpit gives you.

Okay, so that's the goal for

today. If there's any questions you can speak up all the time. You can keep it interactive. You

could also put questions in the chat and we could do a question and answer at the end. Now if I

could get the next slide. Okay, so what happens sometimes, and some of you may have noticed that,

is that you get an email from our support saying

dear user

many of your jobs are not properly

utilizing the allocated GPU resources and there are some links to the monitoring

which is already

a link to the Cluster Cockpit website.

Please optimize your code to ensure efficient usage.

You can check it yourself when logging into our monitoring system Cluster Cockpit at

monitor-reconn.hiv.ud and so on and so on.

So this is one of the typical patterns some of our support

staff have noticed that you're allocating a GPU but you're not using it.

So this is why you get

this email. So that's a very typical pattern. There are a couple of others and the text is

usually a little bit different then but yeah you get the drift. Okay, so let me show you a list of

these patterns and then some details about what those patterns mean.

So first of all, the two main

reasons for bad performance and that means also the stuff that is easily fixed so to speak because

they are like what I call operational issues is that you allocate too many nodes and you're not

using all of them or you're allocating some resources that you're not going to use or you

oversubscribe the resources. Okay, you don't allocate enough resources for the stuff that

you're running so that for example on a compute node more processes or threads are running than

our threads

hyper threads or cores available although in the latter case usually the batch

system or whatever you use to bind your threads will warn you about that but it can still happen

that you oversubscribe.

The first case is by far the most common.

Okay, that people allocate

resources that they're not using or allocating 10 nodes using only one core because accidentally

you call their own binary, yeah, stuff like that and that's what I mean by easy to fix.

Now the other group of bad performance reasons are reasons that are not so easy to fix because

they have some inherent issues what I call workflow or code issues.

So something that is

inherent to the way you start your program maybe utilizing too many nodes for the workload which

Teil einer Videoserie :

HPC4FAU / NHR@FAU

Teil eines Kapitels:

HPC Café

Presenters

Dr. Georg Hager

Zugänglich über

Offener Zugang

Dauer

01:05:40 Min

Aufnahmedatum

2025-10-14

Hochgeladen am

2025-11-06 21:00:06

Sprache

en-US

Topic: Assessing the efficiency of your jobs with ClusterCockpit

Speaker: Jan Eitzinger and Georg Hager, NHR@FAU

Slides

Abstract:
Would it not be great if you could observe by yourself how your jobs are doing on the cluster instead of waiting for that e-mail from the admins telling you that you are not making good use of the resources? We have good news: it’s possible! You can monitor your past and running jobs in real time, assessing important things such as load imbalance, flop/s performance, communication, I/O, and more. And if you don’t know what all of this means, that is an even better reason to attend our HPC Cafe this week!

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/60635

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/60635&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Audio

Per RSS abonnieren