Body
The SLURM resource manager on Matilda comes with several built-in tools that can be useful for monitoring jobs and the status of nodes on the cluster. Some of these commands can have fairly complex (but useful) formatting options, so only a brief (but hopefully useful) overview is presented below. References are provided at the end of this document for users who wish to expand upon the information provided herein.
The Goals of Monitoring and a Few Examples
While UTS staff monitor Matilda for functionality, compliance and availability of resources, user self-monitoring of jobs can be very useful to improving the efficiency of your work on the cluster. Some of the benefits of self-monitoring include:
- Determining how much of a resource your jobs are actually using: For example, if you guess you might need 300GB of memory (RAM) but the job only actually requires 50GB, specifying 300GB unnecessarily means your jobs are confined to one of the 4 "bigmem" nodes. Therefore, because "bigmem" nodes are in high demand (and few in number), your job may spend much more time in a queued state than is actually necessary.
- Evaluating job history: It is possible to list all of jobs you've run over a given time period. This information can be used to determine time spent on various tasks, estimate future job resource requirements, or determine the status of a job that ended unexpectedly (e.g., did it fail, complete, what time etc.?).
- Assess currently available resources: When planning your work, it may be helpful to assess what resources are currently available on the cluster. The cluster "occupancy rate" varies considerably, even over short periods of time. For example, it is not uncommon for the cluster to go from being only 5-10% "occupied", to well over 80%, and then a week or so later, back down to 5-10%.
- Evaluating job performance and correctness: Suppose you believe you've correctly specified 40 cpus for a job - and thus believe you'll have the whole node to yourself - only to discover another user has a job running on the same node. Worse, that additional job is now slowing your job down. This is often caused by users incorrectly specifying parameters such as "ntasks" in a way that does not comport with the actual number of threads they are using (e.g. ntasks=1, but you used mpirun -np 40). In these instances, SLURM assigns your job to the node and reserves one (1) cpu core for you, but you are actually using 40. SLURM may then assign another job to that node because by its accounting, the node has 39 cpu cores available. The node is now "over-utilized" and this slows down your job. By using monitoring tools, you can evaluate your job to see if it is actually setup correctly and make changes if necessary before a problem like this occurs.
Useful SLURM Monitoring Commands
sstat
The SLURM command sstat is useful for obtaining information on your currently running jobs. Simply running "sstat <jobid>" will produce many metrics, but the output can be a bit messy. You can control the formatting of the sstat output by using specifiers with the "--format=" flag. For example:
sstat <jobid> --format=JobID,MaxRSS,AveCPU,NTasks,NodeList,TRESUsageInTot%40
...will provide the maximum memory used (MaxRSS), the average CPU utilization (#cores * runtime), the number of tasks, a list of nodes, and the resource utilization in total so far. The "%40" specifier used above with "TRESUsageInTot" provides control over the formatted field-width of the output. To see all of the available format specifier options, you may run:
sstat --helpformat
AveCPU AveCPUFreq AveDiskRead AveDiskWrite
AvePages AveRSS AveVMSize ConsumedEnergy
ConsumedEnergyRaw JobID MaxDiskRead MaxDiskReadNode
MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask
MaxPages MaxPagesNode MaxPagesTask MaxRSS
MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode
MaxVMSizeTask MinCPU MinCPUNode MinCPUTask
Nodelist NTasks Pids ReqCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov TRESUsageInAve
TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin
TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve
TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
For more information on sstat, refer to the SLURM documentation.
squeue
While the squeue command is well known by most SLURM users, be aware that it is possible to obtain more information than running the default squeue command alone provides. For example, running:
squeue --format=%10i%15u%15j%5t%15M%15l%8C%30N%10D%
...will produce formatted output containing job id, username, job name, elapsed time, walltime, the number of CPUs, a list of nodes, and the number of nodes utilized. For instance:
JOBID USER NAME ST TIME TIME_LIMIT CPUS NODELIST NODES
76856 someuser is_lslf R 1:19:55 20:10:00 8 hpc-throughput-p07 1
76855 someuser is_rsf R 1:30:55 20:10:00 8 hpc-throughput-p06 1
76854 someuser is_rlf R 1:41:25 20:10:00 8 hpc-throughput-p05 1
76850 otheruser dfly_p18 R 2:14:03 2-10:10:00 32 hpc-bigmem-p02 1
76833 newuser mohiL-3PR R 15:03:44 6-16:00:00 1 hpc-throughput-p01 1
76832 newuser mohiL-4PR R 15:04:51 6-16:00:00 1 hpc-bigmem-p01 1
The "%#" specifiers control the field width, and the letter suffixes (e.g. "%10i") reference the format field (JobID width=10). Although the format specifiers for squeue are a bit obscure, if you find a format that is particularly useful, you can define the format to use whenever you login to Matilda by setting the value of "SQUEUE_FORMAT" in your ".bashrc" file. For example:
export SQUEUE_FORMAT="%10i%15u%15j%5t%15M%15l%8C%30N%10D%"
Refer to the SLURM squeue documentation for more information.
sacct
The sacct command is useful for reviewing the status of running or completed jobs. In its simplest form, you need only use "sacct -j <jobid>" for any running or completed/failed job. Like 'squeue' and 'sstat', the sacct command can be used with format modifiers/specifiers to obtain additional (or to filer) information. For example:
sacct -j 999888 --format=JobID%12,State,User,Account%30,TimeLimit,ReqTRES%45,Partition%15
JobID State User Account Timelimit ReqTRES Partition
------------ ---------- --------- ------------------------------ ---------- --------------------------------------------- ---------------
999888 FAILED someuser+ myjobName-here 20:10:00 billing=8,cpu=8,mem=772512M,node=1 general-long
999888.batch FAILED myjobName-here
999888.extern COMPLETED myjobName-here
This shows how long the job ran, as well as trackable resources (cpu, gpu, billing, etc.), and the partition. Note that for this job there is a primary job number, as well as variations with the ".batch" and ".extern" suffixes. These are "job steps" created by SLURM for every job (MPI jobs may have many more steps, one for each task). The former designates this as a "batch" job and tracks the resources used inside of the batch job. The latter accounts for SLURM resource specifications external to the job.
There are many possible format specifier that can be used with the sacct command. To see a list, use:
sacct --helpformat
Account AdminComment AllocCPUS AllocNodes
AllocTRES AssocID AveCPU AveCPUFreq
AveDiskRead AveDiskWrite AvePages AveRSS
AveVMSize BlockID Cluster Comment
Constraints Container ConsumedEnergy ConsumedEnergyRaw
CPUTime CPUTimeRAW DBIndex DerivedExitCode
Elapsed ElapsedRaw Eligible End
ExitCode Flags GID Group
JobID JobIDRaw JobName Layout
MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite
MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode
MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask
MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel
MinCPU MinCPUNode MinCPUTask NCPUS
NNodes NodeList NTasks Priority
Partition QOS QOSRAW Reason
ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov
ReqCPUS ReqMem ReqNodes ReqTRES
Reservation ReservationId Reserved ResvCPU
ResvCPURAW Start State Submit
SubmitLine Suspended SystemCPU SystemComment
Timelimit TimelimitRaw TotalCPU TRESUsageInAve
TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin
TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve
TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID
User UserCPU WCKey WCKeyID
WorkDir
For more information, refer to the SLURM documentation on sacct.
scontrol
The scontrol command is helpful in evaluating detailed information on a node or running job. To see the state of a particular node (e.g. hpc-compute-p01):
scontrol show node hpc-compute-p01
NodeName=hpc-compute-p01 Arch=x86_64 CoresPerSocket=20
CPUAlloc=32 CPUTot=40 CPULoad=32.01
AvailableFeatures=local
ActiveFeatures=local
Gres=(null)
NodeAddr=hpc-compute-p01 NodeHostName=hpc-compute-p01 Version=21.08.8-2
OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022
RealMemory=191895 AllocMem=0 FreeMem=185655 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
Partitions=general-short,general-long,rusakov
BootTime=2023-03-09T00:18:27 SlurmdStartTime=2023-03-09T00:20:33
LastBusyTime=2023-03-20T05:06:59
CfgTRES=cpu=40,mem=191895M,billing=40
AllocTRES=cpu=32
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Here, we can see how many cores the node has (40); how many cores are allocated (32); details about system memory; and the node state (in this case, MIXED means there are jobs allocated but not all resources are assigned).
We can also see job details using:
scontrol show job id 999888
JobId=999888 JobName=slurm_omp.sh
UserId=someuser(123456) GroupId=faculty(1002) MCS_label=N/A
Priority=1 Nice=0 Account=rusakov-research-group QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=20:37:19 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2023-03-21T14:11:51 EligibleTime=2023-03-21T14:11:51
AccrueTime=2023-03-21T14:11:51
StartTime=2023-03-21T14:11:51 EndTime=2023-03-23T14:11:52 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T14:11:51 Scheduler=Main
Partition=general-long AllocNode:Sid=hpc-login-p01:1446834
ReqNodeList=(null) ExcNodeList=(null)
NodeList=hpc-compute-p[33-35],hpc-throughput-p01
BatchHost=hpc-compute-p33
NumNodes=4 NumCPUs=100 NumTasks=100 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=100,node=4,billing=100
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm_omp.sh
WorkDir=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile
StdErr=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out
StdIn=/dev/null
StdOut=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out
Power=
The result contains a plethora of information on cores, memory, nodes being utilized, and other detailed job information.
For more information, please refer to the SLURM documentation on scontrol.
seff
The SLURM seff command is useful for assessing job efficiency. Please note, that seff results for running jobs may be incorrect or misleading. However, if you run seff on a job id that has already completed, it can be useful in assessing your completed job performance. For example:
seff 999888
Job ID: 999888
Cluster: slurm
User/Group: someuser/students
State: FAILED (exit code 134)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:05:15
CPU Efficiency: 12.15% of 00:43:12 core-walltime
Job Wall-clock time: 00:05:24
Memory Utilized: 398.96 MB
Memory Efficiency: 0.00% of 0.00 MB
In the example above, the job failed. You can use seff in conjunction with the "-d" (debug) flag for additional information:
seff -d 999888
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 999888 someuser students FAILED slurm 8 1 1 0 1 315 324 408540 134
Job ID: 999888
Cluster: slurm
User/Group: someuser/students
State: FAILED (exit code 134)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:05:15
CPU Efficiency: 12.15% of 00:43:12 core-walltime
Job Wall-clock time: 00:05:24
Memory Utilized: 398.96 MB
Memory Efficiency: 0.00% of 0.00 MB
Seff is a contributor script distributed with SLURM. For more information, refer to the contribs repo for the seff command.
Integrating SLURM Commands
While SLURM monitoring commands are very useful, they can be even more powerful when used in conjunction with other resources and techniques. Make sure to checkout the documents on HPC Powertools and On-Node Monitoring for more information.