Slurm
Slurm is the software used on the cluster to launch and manage jobs. After connecting to pbil-deb
(the submission node), you can use Slurm commands to submit programs, scripts or pipelines to be run on some of the computing nodes.
Launching a job
Suppose you have a bash script script.sh
that you want to run on the cluster. You can submit it to the execution queue with sbatch
:
But you'll have to add additional arguments to specify your job requirements and how it is run.
Partitions
The cluster computing nodes are split into several partitions.
Partition | Notes |
---|---|
normal (default) | Default partition |
interactive | Max time of 4 hours |
bigmem | For jobs with large RAM requirements |
long | Max time of 720 hours instead of 168 |
gpu | Nodes with a GPU |
The partition on which to run the job is given by the --partition
argument:
# Run on default partition (normal)
sbatch script.sh
# Run on a partition with high execution time (long)
sbatch --time=240:00:00 --partition=long script.sh
# Run on a partition with large RAM
sbatch --mem=128G --partition=bigmem script.sh
Warning
If you want to run a job with an execution time of more than a week, please talk about it with the cluster admins first. This is primarily to check that no maintenance is planned that would interrupt your computations.
Memory and computing requirements
Main arguments:
--time
: maximum job running time (specified asHH:MM:SS
ordays-HH:MM:SS
)--nodes
: (default 1) number of nodes to use (ie number of machines)--ntasks
: (default 1) total number of processes to run--cpus-per-task
: (default 1) number of CPU per each task--mem
: amount of required total RAM per node (job is cancelled if it asks for more)--mem-per-cpu
: amount of required RAM per CPU--gpus
: number of required GPU--nodelist
: ask for a specific node to run the job--constraint
: ask for specific node constraint
To run a job specifically on one or several nodes, you can use:
# Launch a job specifically on pbil-deb33
sbatch --nodelist=pbil-deb33 script.sh
# Launch a job on pbil-deb33, pbil-deb34 or pbil-deb38
sbatch --nodelist=pbil-deb33,pbil-deb34,pbil-deb38 script.sh
Constraints allow to ask for nodes with specific features, for example:
# Launch a job on a node with AMD CPUs
sbatch --constraint=amd script.sh
# Launch a job with an avx2 enabled CPU
sbatch --constraint=avx2 script.sh
# Launch a job with an A30 or A40 GPU
sbatch --constraint="a30|a40" script.sh
Tip
- For
constraint
it is possible to use|
(or) and&
(and) to specify complex requirements. - To see the list of available constraints, you can run
sinfo -o "%n: %f"
.
Execution parameters
Main arguments:
--output
: send job standard output to specific file--error
: send job standard error to specific file--mail-type
: send email to--mail-user
for these events. Can be a combination ofBEGIN
,END
and `FAIL--mail-user
: email address to send events messages--job-name
: specify a job name to be displayed insqueue
Examples
Launch a job for a max of 10 days, 8 CPU for 4 tasks and 8G of RAM:
Launch a job and be notified by mail when it starts, ends, and if it fails:
Launch a job on an A30 or A40 GPU and send standard output and error to specific files:
Slurm script
Instead of specifying arguments in the command line, you can create a slurm script, which is a standard shell script with requirements specified at the beginning.
Here is an example slurm script which defines some requirements and runs a Python script with uv:
#!/bin/bash
#SBATCH --job-name=wonderful_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem 8G
#SBATCH --time 0-12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=mymail@univ-lyon1.fr
#SBATCH --output=slurm_output.log
#SBATCH --error=slurm_error.log
uv run script.py
If you save it in myjob.slurm
, you can run it with:
Open an interactive session
It is possible to open an interactive session on a computing node to be able to run commands directly on it.
This is done by running the sinter
command. All arguments available for sbatch
are also available for sinter
.
Warning
By default, sinter
sessions are opened only for an hour. It is possible to increase this time up to 4 hours with --time
.
# Launch an interactive session with the default requirements (1 hour, 1 CPU...)
sinter
# Launch an interactive session for 2 hours on a node with a GPU
sinter --time=02:00:00 --gpus=1
Monitoring jobs
squeue
allows to get informations about pending or running jobs.
Tip
In the squeue
output, the ST
column gives the job status: R
is "running", PD
is "pending".
You can also use --format
and --sort
argument to customize squeue
output. For example, the following lists the jobs user, priority, state, end time, time limit and partition and sorts them by priority:
To get more informations about a specific job, get its job ID with squeue
and run:
Tip
If your job is in "pending" state, the StartTime
field of this scontrol
output gives the estimated starting time.
There is also a web interface for the cluster jobs queue.
Cancelling jobs
Use scancel
to cancel a submitted job: