Skip to content

Slurm

Slurm is the software used on the cluster to launch and manage jobs. After connecting to pbil-deb (the submission node), you can use Slurm commands to submit programs, scripts or pipelines to be run on some of the computing nodes.

Launching a job

Suppose you have a bash script script.sh that you want to run on the cluster. You can submit it to the execution queue with sbatch:

sbatch script.sh

But you'll have to add additional arguments to specify your job requirements and how it is run.

Partitions

The cluster computing nodes are split into several partitions.

Partition Notes
normal (default) Default partition
interactive Max time of 4 hours
bigmem For jobs with large RAM requirements
long Max time of 720 hours instead of 168
gpu Nodes with a GPU

The partition on which to run the job is given by the --partition argument:

# Run on default partition (normal)
sbatch script.sh

# Run on a partition with high execution time (long)
sbatch --time=240:00:00 --partition=long script.sh

# Run on a partition with large RAM
sbatch --mem=128G --partition=bigmem script.sh

Warning

If you want to run a job with an execution time of more than a week, please talk about it with the cluster admins first. This is primarily to check that no maintenance is planned that would interrupt your computations.

Memory and computing requirements

Main arguments:

  • --time: maximum job running time (specified as HH:MM:SS or days-HH:MM:SS)
  • --nodes: (default 1) number of nodes to use (ie number of machines)
  • --ntasks: (default 1) total number of processes to run
  • --cpus-per-task: (default 1) number of CPU per each task
  • --mem: amount of required total RAM per node (job is cancelled if it asks for more)
  • --mem-per-cpu: amount of required RAM per CPU
  • --gpus: number of required GPU
  • --nodelist: ask for a specific node to run the job
  • --constraint: ask for specific node constraint

To run a job specifically on one or several nodes, you can use:

# Launch a job specifically on pbil-deb33
sbatch --nodelist=pbil-deb33 script.sh

# Launch a job on pbil-deb33, pbil-deb34 or pbil-deb38
sbatch --nodelist=pbil-deb33,pbil-deb34,pbil-deb38 script.sh

Constraints allow to ask for nodes with specific features, for example:

# Launch a job on a node with AMD CPUs
sbatch --constraint=amd script.sh

# Launch a job with an avx2 enabled CPU
sbatch --constraint=avx2 script.sh

# Launch a job with an A30 or A40 GPU
sbatch --constraint="a30|a40" script.sh

Tip

  • For constraint it is possible to use | (or) and & (and) to specify complex requirements.
  • To see the list of available constraints, you can run sinfo -o "%n: %f".

Execution parameters

Main arguments:

  • --output: send job standard output to specific file
  • --error: send job standard error to specific file
  • --mail-type: send email to --mail-user for these events. Can be a combination of BEGIN, END and `FAIL
  • --mail-user: email address to send events messages
  • --job-name: specify a job name to be displayed in squeue

Examples

Launch a job for a max of 10 days, 8 CPU for 4 tasks and 8G of RAM:

sbatch --partition=long --time=10-00:00:00 --ntasks=4 --cpus-per-task=2 --mem=8G script.sh

Launch a job and be notified by mail when it starts, ends, and if it fails:

sbatch --mail-type=BEGIN,END,FAIL --mail-user=foobar@example.com script.sh

Launch a job on an A30 or A40 GPU and send standard output and error to specific files:

sbatch --gpus=1 --constraint="a30|a40" --output="out.txt" --error="err.txt" script.sh

Slurm script

Instead of specifying arguments in the command line, you can create a slurm script, which is a standard shell script with requirements specified at the beginning.

Here is an example slurm script which defines some requirements and runs a Python script with uv:

#!/bin/bash

#SBATCH --job-name=wonderful_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem 8G
#SBATCH --time 0-12:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=mymail@univ-lyon1.fr
#SBATCH --output=slurm_output.log
#SBATCH --error=slurm_error.log

uv run script.py

If you save it in myjob.slurm, you can run it with:

sbatch myjob.slurm

Open an interactive session

It is possible to open an interactive session on a computing node to be able to run commands directly on it.

This is done by running the sinter command. All arguments available for sbatch are also available for sinter.

Warning

By default, sinter sessions are opened only for an hour. It is possible to increase this time up to 4 hours with --time.

# Launch an interactive session with the default requirements (1 hour, 1 CPU...)
sinter

# Launch an interactive session for 2 hours on a node with a GPU
sinter --time=02:00:00 --gpus=1

Monitoring jobs

squeue allows to get informations about pending or running jobs.

# List all running jobs
squeue
# List running jobs for a specific user
squeue -u login

Tip

In the squeue output, the ST column gives the job status: R is "running", PD is "pending".

You can also use --format and --sort argument to customize squeue output. For example, the following lists the jobs user, priority, state, end time, time limit and partition and sorts them by priority:

squeue --format "%.8u %.10Q %.8T %.10e %.10l %P" --sort "-Q"

To get more informations about a specific job, get its job ID with squeue and run:

scontrol show job <id>

Tip

If your job is in "pending" state, the StartTime field of this scontrol output gives the estimated starting time.

There is also a web interface for the cluster jobs queue.

Cancelling jobs

Use scancel to cancel a submitted job:

# Cancel a specific job
scancel <job_id>

# Cancel all my jobs
scancel -u mylogin

# Cancel all my pending jobs
scancel -u mylogin -t PENDING