Skip to main content

Submitting Batch Jobs


Elja uses SLURM as the batch scheduler and resource manager. Basic common commands are summarized below.

sbatchsubmit a batch job script
srunrun a parallel job
squeue (-a, -u $USER)show queue status
sinfoview info about nodes and partitions
scancel JOBIDcancel a job

Batch jobs#

The command sbatch is used to submit jobs to the SLURM queue

[..]$ sbatch submit_script

A batch submit script usually starts like this

#SBATCH --mail-type=ALL
#SBATCH --mail-user=<Your E-mail> # for example
#SBATCH --partition=48cpu_192mem # request node from a specific partition
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=48 # 48 cores per node (96 in total)
#SBATCH --mem-per-cpu=3900 # MB RAM per cpu core
#SBATCH --time=0-04:00:00 # run for 4 hours maximum (DD-HH:MM:SS)
#SBATCH --hint=nomultithread # Suppress multithread
#SBATCH --output=slurm_job_output.log
#SBATCH --error=slurm_job_errors.log # Logs if job crashes
. ~/.program_env_bash
mpirun python

Here two nodes from the 48mem_192mem partition is requested, using 48 processors per node for a total of 96 processors. The memory per cpu-core is set to 3900MB RAM. See the Partitions & Hardware for details on the available partitions.

When the SLURM scheduler has allocated the resources the subsequent lines are executed in order. First a program environment bash is loaded (see Program Environment), and an mpirun instance of a Python script is executed.

Hyper-threading of the intel based CPUs is on by default, hence it is is highly recommended to suppress it in your submit (or .bashrc) script (unless your software supports and is correctly compiled with openmp).

For .basrhc


After submitting a job you can view the current status and jobids' like this

[..]$ squeue -u $USER
11729 48cpu_192 Interact <uname> R 2:10 1 compute-17

You can cancel a job using the JOBID number. In this example

[..]$ scancel 11729

IF your job requires a lot of input data, or if it generates a lot of output it is advisable to make use of the /scratch/ disk available on the compute nodes. See the next section.