HPC Launching Jobs

Summary

This article introduces the use of the Slurm Workload Manager on the Matilda HPC Cluster.

Body

The scheduling software Slurm can be used to launch jobs on Matilda HPC Cluster. The srun command is designed for interactive use, with someone monitoring the output. Batch jobs can be launched using the sbatch command or Slurm job submission script. If you're unfamiliar with Slurm Workload Manager, have a look at the Slurm documentation at SchedMD

Who is Eligible?

Active Faculty or Staff

Main Slurm Commands

sbatch

sbatch - submit a job script. The sbatch command submits a batch processing job to the slurm queue manager. These scripts typically contain one or more srun commands to queue jobs for processing.

sbatch samplejobscript.sh (That needs 16 cores in total, spread to 4 nodes, and using 4 cpus)

#!/bin/bash
#
# Sample Batch Script
#
#
# specify how many nodes (physical server) to use.
#SBATCH --nodes=4
# use -n or --ntasks to specify how many tasks to run
#SBATCH --ntasks=4
# Specify how many CPU cores to use per task
#SBATCH --cpus-per-task=4
# Specify a time limit for the job run
#SBATCH --time=00:10:00
# Standard output and error log
#SBATCH --output=job_output_%j.log

# Clear the environment from any previously loaded modules

module purge > /dev/null 2>&1

# Load the module environment suitable for the job

module load gcc slurm

# And finally run the job​

srun hostname
srun sleep 10

srun

srun - run a command on allocated compute node(s). The srun command is used to submit jobs for execution, or to initiate steps of jobs in real time. For the full range of options that can be passed to the srun command,.

scancel

scancel - delete a job. The scancel command will terminate pending and running job steps. You can also use it to send a unix signal to all processes associated with a running job or job step.

        scancel <jobid>

squeue

squeue - show state of jobs. The squeue command will report the state of running and pending jobs.

        squeue -u username

sinfo

sinfo - show state of nodes and partitions (queues). The sinfo command will report the status of the available partitions and nodes

smap

smap - show jobs, partitions and nodes in a graphical network topology. The smap command is similar to the sinfo command, except it displays all of the information in a pseudo-graphical, ncurses terminal.

scontrol

scontrol - modify jobs or show information about various aspects of the cluster The scontrol command is used to tweak a number of slurm things. You'll most likely use it to modify your jobs while they're in the queue, either number of nodes or number of tasks/cpus. Can also be used to display information about jobs, partition structures, and nodes.

Details

Details

Article ID: 244
Created
Thu 4/3/25 3:48 PM
Modified
Mon 10/20/25 9:59 AM