Hello everyone and welcome to this month's HPC Cafe. Please excuse that I do not enable
my video for this talk, but I might have some bandwidth problems and I hope that you will
be able to follow along even without me having some video.
Okay, so today's talk is about SLURM. I will give you a quick rundown of the basics and
then we'll have a look at some best practices for different usage scenarios. And in the
last part, I will give you a few points of advanced usage and also some general remarks
and tips that could be helpful for you. Okay, so let's get started with the basics. First
of all, to preface all this, I will only give a very short overview of the basics SLURM
usage. If you have never before used the batch system and are not familiar with the general
concepts, then I would like to invite you to the beginner's introduction tomorrow. It's
tomorrow at four and there I will go into a bit more detail about all the basics. So
first of all, this talk will also be just to give you an idea of all the concepts and
all the things that are possible with SLURM. And I cannot go into detail for many of the
topics because this would just be too long for this talk. So if you want to know more
about specific topics, there are a number of SLURM documentations available. First of
all, of course, our own documentation. Here we have a more general documentation under
the first link, where I get an idea of the general concepts. And we also have a cluster-specific
documentation for most of our clusters. So there can always be some things that are specific
to one cluster, which will not work on the other clusters. So we have a look at this
documentation. There are also a number of examples and example scripts that can be a
good starting point for your own work. Of course, there's also the official SLURM documentation.
And this is very detailed and you have a separate documentation for every command and all the
available options for every command. You can find an overview under the first link below.
So if you want to know what a specific option does, this should be the first source that
you look at. If you are switching to SLURM from another batch system, so perhaps from
PBS Torque or any other batch system, the official SLURM documentation also has an extensive
overview of the commands, of the SLURM commands and also of their counterparts and different
other batch systems. So this can also be a good starting point if you want to switch
over from another batch system. There are also some official SLURM tutorials. So they
have YouTube videos which show a bit of the basic usage too. Okay. So as a first point,
I want to go a bit about into the terminology that SLURM uses, which can be a bit different
than you might be used to. So first of all, a job. A job is an allocation of resources
that are assigned to a user for a specific amount of time. A partition. A partition is
a set of nodes which are usually grouped by a specific property. For example, a specific
hardware that is built into those nodes. It can also have constraints on things like job
size, time limit, the users that are permitted to use it, et cetera. This is equivalent to
queues in, for example, in Torque. A task in SLURM is, or a number of tasks in SLURM
is the number of how many instances of your command are executed. This normally corresponds
to the number of MPI processes you are using to run your application. A job step is a set
of tasks within your job. And a job can of course contain multiple job steps that can
either be executing sequentially or in parallel. Next thing that we are only using in some
cases is the quality of service of queues. So this is able to limit some things like
wall time, the number of GPUs you are allowed to use, the jobs that you're allowed to run
simultaneously on a per group basis. So for some partitions or for some types of jobs,
you might need to specify a specific quality of service and you normally have to be enabled
to do that. GRESS, this in SLURM speak stands for Generic Resources. And here in our classes,
this always means GPUs. A CPU in SLURM terminology is equivalent to a hyper thread if the node
is configured with using SMT threads. Otherwise, it's just equivalent to one core inside the
node. Okay, so with all of this out of the way, let's look at ways to get a job allocation.
There are two to three different ones. So when you issue the sbatch command from a frontend
Presenters
Katrin Nusser
Zugänglich über
Offener Zugang
Dauer
01:04:21 Min
Aufnahmedatum
2022-04-12
Hochgeladen am
2022-04-12 21:06:04
Sprache
en-US