The Slurm Cookbook
Posted on 10 June 2018 in tutorials
The Slurm Cookbook is meant to provide you with a brief overview of Slurm and a few examples of how you can interact with it and obtain information about jobs and resources.
Of course, all of the information presented here can be found by in the Slurm documentation, but if you are just getting started with Slurm it may not be obvious where to look for things.
Note: The Cookbook does not really cover batch scripts, because there are many other guides online which provide plenty of examples.
A three course meal
First of all, what does Slurm actually do? Slurm is a cluster management and job scheduling system for Linux clusters (an ensemble of connected computers running Linux). As such, Slurm is mainly concerned with three things:
-
Allocating resources to users so that they can perform work (by resources, we mean the computers (or nodes) which form the cluster).
-
Managing a queue of pending jobs. As long as there are free resources on the cluster, Slurm can simply give users the resources they need. But at some point, the requests of the various users may be incompatible. It is Slurm's job to decide what to execute when, and trying to do that in a smart way so as to use the resources efficiently.
-
Handling the launch, executing, and monitoring of jobs submitted by users.
The main ingredients
Let's first get an overview of the main user commands that Slurm provides, before looking at some examples of how we can use them.
squeue
allows you to view information about jobs. In particular, you will see which jobs are pending, which are running, which are completing, and which are completed. These are the states that a job will encounter throughout its lifetime assuming everything goes according to plan (but a job may also fail or be cancelled for several reasons).salloc
provides a Slurm job allocation, i.e. a set of resources. It then runs a command specified by the user, and when command is complete, frees up the job allocation.sbatch
: submit batch script to run jobs non-interactively on the compute nodes.srun
Runs parallel jobs on allocated resources (but can also request resources directly).sacct
displays accounting data (i.e. data on past jobs which are not on the queue anymore).sancel
can be used to cancel jobs on the queue, or more generally to send them signals.sinfo
gives you information about the nodes and partitions which compose the cluster, such as the number of nodes and their state.scontrol
allows you to view and modify Slurm configurations (but it's mostly for system administrators).
A handy summary of these commands together with some options can also be found here.
A note on sbatch
, salloc
, and srun
It may be confusing at first to distinguish between these three commands. In practice, you can think of it as follows: what salloc
and sbatch
do is allocate resources for jobs (i.e. they tell the system: "I need these many CPUs with this much memory "), whereas srun
uses those resources to launch parallel jobs. The difference between sbatch
and salloc
is that sbatch
is used to submit batch scripts which run non-interactively on the compute nodes, whereas salloc
can get you the resources and allow you to run interactive commands on the nodes.
srun
is generally launched within a job allocation, meaning that either salloc
or sbatch
have allocated the resources, and srun
inherits these resources and the relevant options (which you could also overwrite). But srun
can also fight for itself: if is called outside a job allocation, it can claim the necessary resources directly.
Some good recipes
This part contains a few examples of how you can use some of the main user command.
Inspecting the queue with squeue
Let's start simple: let's look at the current queue using squeue
. Sometimes there may be a lot of jobs on the queue belonging to other users, which can make it difficult to check on your own jobs. You can filter the jobs that squeue
displays by passing the --user
parameter.
squeue --user=
Generally squeue
cuts the name of the job to a few characters, which can be annoying. It would be nicer to have it display the full job name. The --format
parameter comes to the rescue
squeue --format="%.8i %.9P %18j %10u %.8T %.12M %9N"
By default Slurm only allocates 8 characters to the job name, so in the example above, for the field j
which corresponds to job name, we can change %.8j
to %18j
to make sure the full name gets displayed. Removing the dot aligns the text to the left. In the documentation for squeue
you can find all the fields which you can pass to --format
and format them to your liking.
Of course, we don't want to have to type this every time. What we can do instead is creating an alias. Let's copy the following into the .bashrc
file (which is in your home directory):
alias squeue='squeue --format="%.8i %.9P %18j %10u %.8T %.12M %9N"'
Now, we can have our nicely formatted output by simply typing squeue
.
NOTE: The .bashrc
script is executed every time you start an interactive shell on the cluster (i.e. when you log in), and it can be used to initialize a bunch of useful stuff. Changes you make it will not have any effect unless you log out and log back in or directly source it by running source ~/.bashrc
.
Note that squeue
also supports a --Format
parameter. Generally, --Format
is more readable (for job name you would pass name
instead of j
) and has more fields (in --format
, since it only uses single letters, at some point the letters run out), but --format
is more flexible for formatting the output (you can't specify the number of characters using --Format
as far as I know).
Here is an example using --Format
:
squeue --user=<myusername> --Format=jobid,username,account,statecompact,starttime,timelimit,numcpus
Viewing system properties and state with sinfo
Let's not forget to take a careful look at our cluster, as this may influence how we decide to design or run our applications. By simply running sinfo
, we can take a look at the partitions: their availability, their state, how many nodes they have, etc.
We can also take a look at the individual nodes, using the -N
option. To include some more information in the output (for example the memory of each node) we can pass the -l
flag (compactly, sinfo -Nl
).
A very useful additional field is the number of cores on each node, by using --format
and the C
field.
sinfo -N --format "%.10n %.9P %.15C %.10m %.10O %.8t"
On the cluster of the lab, the output looks as follows:
HOSTNAMES PARTITION CPUS(A/I/O/T) MEMORY CPU\_LOAD STATE
csn-cpu1 cpu\* 4/8/0/12 31996 10.44 mix
csn-cpu2 cpu\* 2/10/0/12 31996 2.02 mix
csn-cpu3 cpu\* 0/12/0/12 31996 0.01 idle
csn-cpu4 cpu\* 1/11/0/12 31996 1.01 mix
csn-cpu5 cpu\* 0/12/0/12 31996 0.01 idle
csn-cpu6 cpu\* 0/40/0/40 128736 0.01 idle
csn-cpu7 cpu\* 0/40/0/40 128735 0.01 idle
csn-cpu8 cpu\* 0/40/0/40 128736 0.01 idle
csn-cpu9 cpu\* 12/28/0/40 128736 8.50 mix
csn-gpu1 gpu 4/36/0/40 257675 0.19 mix
Here, memory is expressed in megabytes. Note that CPUS on each node are presented in the A/I/O/T format, which stands for allocated/idle/other (offline/down)/total. You can see that the state is idle
when no cpu cores are allocated, and mix
when some but not all the cores are being used. The CPU_LOAD
on the other hand is an indication of how much work we are giving to our nodes.
We can obtain a similar result with the --Format
parameter
sinfo -N --Format=nodehost,cpusstate,cpusload,memory,statecompact
Lastly, I may want to view information only of the nodes that I am currently using. We can achieve this by first using squeue
to find out where my jobs are running, create a variable called mynodes
by parsing the output of squeue
, and then use it in the call to sinfo
:
mynodes=$(squeue --user=pmarche1 -o "%N" | grep csn | tr "\\n" ",")
sinfo -N --nodes "$mynodes" --Format=nodehost,cpusstate,cpusload,memory,statecompact
Reviewing past jobs with sacct
If we want to find out information about jobs which are not on the queue anymore (so they don't show up with squeue
) we can resort to sacct
.
For instance, we may be interested in finding out on which compute nodes our jobs ended up. We can specify a list of job IDs, as well as a list of fields for the --format
option.
sacct --jobs=7058,7057 --format=User,JobID,account,AllocNodes,NodeList,Timelimit,elapsed,ReqMem,MaxRss,ExitCode
Here, the NodeList
field will tell us on which nodes our jobs landed. --format
supports many more fields, which you can list by sacct --helpformat
.
Note that sacct
by default (and if you do not specify the job IDs with --jobs
) only shows you jobs from the current day (since midnight). To show older jobs, you can pass the --starttime
option:
sacct --starttime=<mm.dd.yy>
sacct
may also not display the full job name by default. To make sure that we see the full name, we can format it by adding %
and a number of characters after each field, for instance
sacct --user pmarche1 --format=User,JobID,JobName%20,AllocNodes,NodeList
This allocates 20 characters to the field JobName
. Again, you can use aliases to avoid having to remember the full command.
Running interactive jobs
In some occasions we may want to run interactive jobs on one or more nodes. To do so, you can call salloc
and then start an interactive session using srun
.
salloc --partition <partition> srun -N1 --pty /bin/bash
In the example, we specify the partition we want to work on and the number of nodes (but we could specify other requirements for the allocation). Importantly, we have to pass the --pty
option which tells Slurm to drop us in a ssh
session on the node we receive. Lastly, we need to give srun
the actual command to run: since we want an interactive session, we want to run some kind of shell, for instance /bin/bash
.
Earlier on, I mentioned that srun
can also be called without the help of sbatch
or salloc
, in which case it will go and fetch resources directly. As an example, we can in fact do:
srun --partition cpu --pty /bin/bash
Killing jobs with scancel
Lastly, let's kill some jobs. The most basic way is to pass a job ID to scancel
, for instance scancel 3425
. But suppose you have many jobs on the queue which need to be cancelled: copy-pasting each job ID can take forever. Luckily, there are many ways in which you can constrain the scancel
operation. For instance, you can kill all your jobs with
scancel --user=<myusername>
Or only jobs with a certain name
scancel --jobname=<myjobID>
You can cancel only job which are in a particular state (state can be PENDING
, RUNNING
, SUSPENDED
).
scancel --state=<state>
And of course combine all these options however you want.