30 - HPC-Cafe 2022-04-12: SLURM - Basics, Best Practices, Advanced Usage [ID:41306]
50 von 525 angezeigt

Hello everyone and welcome to this month's HPC Cafe. Please excuse that I do not enable

my video for this talk, but I might have some bandwidth problems and I hope that you will

be able to follow along even without me having some video.

Okay, so today's talk is about SLURM. I will give you a quick rundown of the basics and

then we'll have a look at some best practices for different usage scenarios. And in the

last part, I will give you a few points of advanced usage and also some general remarks

and tips that could be helpful for you. Okay, so let's get started with the basics. First

of all, to preface all this, I will only give a very short overview of the basics SLURM

usage. If you have never before used the batch system and are not familiar with the general

concepts, then I would like to invite you to the beginner's introduction tomorrow. It's

tomorrow at four and there I will go into a bit more detail about all the basics. So

first of all, this talk will also be just to give you an idea of all the concepts and

all the things that are possible with SLURM. And I cannot go into detail for many of the

topics because this would just be too long for this talk. So if you want to know more

about specific topics, there are a number of SLURM documentations available. First of

all, of course, our own documentation. Here we have a more general documentation under

the first link, where I get an idea of the general concepts. And we also have a cluster-specific

documentation for most of our clusters. So there can always be some things that are specific

to one cluster, which will not work on the other clusters. So we have a look at this

documentation. There are also a number of examples and example scripts that can be a

good starting point for your own work. Of course, there's also the official SLURM documentation.

And this is very detailed and you have a separate documentation for every command and all the

available options for every command. You can find an overview under the first link below.

So if you want to know what a specific option does, this should be the first source that

you look at. If you are switching to SLURM from another batch system, so perhaps from

PBS Torque or any other batch system, the official SLURM documentation also has an extensive

overview of the commands, of the SLURM commands and also of their counterparts and different

other batch systems. So this can also be a good starting point if you want to switch

over from another batch system. There are also some official SLURM tutorials. So they

have YouTube videos which show a bit of the basic usage too. Okay. So as a first point,

I want to go a bit about into the terminology that SLURM uses, which can be a bit different

than you might be used to. So first of all, a job. A job is an allocation of resources

that are assigned to a user for a specific amount of time. A partition. A partition is

a set of nodes which are usually grouped by a specific property. For example, a specific

hardware that is built into those nodes. It can also have constraints on things like job

size, time limit, the users that are permitted to use it, et cetera. This is equivalent to

queues in, for example, in Torque. A task in SLURM is, or a number of tasks in SLURM

is the number of how many instances of your command are executed. This normally corresponds

to the number of MPI processes you are using to run your application. A job step is a set

of tasks within your job. And a job can of course contain multiple job steps that can

either be executing sequentially or in parallel. Next thing that we are only using in some

cases is the quality of service of queues. So this is able to limit some things like

wall time, the number of GPUs you are allowed to use, the jobs that you're allowed to run

simultaneously on a per group basis. So for some partitions or for some types of jobs,

you might need to specify a specific quality of service and you normally have to be enabled

to do that. GRESS, this in SLURM speak stands for Generic Resources. And here in our classes,

this always means GPUs. A CPU in SLURM terminology is equivalent to a hyper thread if the node

is configured with using SMT threads. Otherwise, it's just equivalent to one core inside the

node. Okay, so with all of this out of the way, let's look at ways to get a job allocation.

There are two to three different ones. So when you issue the sbatch command from a frontend

Teil einer Videoserie :
Teil eines Kapitels:
HPC Café

Presenters

Katrin Nusser Katrin Nusser

Zugänglich über

Offener Zugang

Dauer

01:04:21 Min

Aufnahmedatum

2022-04-12

Hochgeladen am

2022-04-12 21:06:04

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen