SLURM Monitoring Commands

Body

The SLURM resource manager on Matilda comes with several built-in tools that can be useful for monitoring jobs and the status of nodes on the cluster. Some of these commands can have fairly complex (but useful) formatting options, so only a brief (but hopefully useful) overview is presented below. References are provided at the end of this document for users who wish to expand upon the information provided herein.

The Goals of Monitoring and a Few Examples

While UTS staff monitor Matilda for functionality, compliance and availability of resources, user self-monitoring of jobs can be very useful to improving the efficiency of your work on the cluster. Some of the benefits of self-monitoring include:

Determining how much of a resource your jobs are actually using: For example, if you guess you might need 300GB of memory (RAM) but the job only actually requires 50GB, specifying 300GB unnecessarily means your jobs are confined to one of the 4 "bigmem" nodes. Therefore, because "bigmem" nodes are in high demand (and few in number), your job may spend much more time in a queued state than is actually necessary.
Evaluating job history: It is possible to list all of jobs you've run over a given time period. This information can be used to determine time spent on various tasks, estimate future job resource requirements, or determine the status of a job that ended unexpectedly (e.g., did it fail, complete, what time etc.?).
Assess currently available resources: When planning your work, it may be helpful to assess what resources are currently available on the cluster. The cluster "occupancy rate" varies considerably, even over short periods of time. For example, it is not uncommon for the cluster to go from being only 5-10% "occupied", to well over 80%, and then a week or so later, back down to 5-10%.
Evaluating job performance and correctness: Suppose you believe you've correctly specified 40 cpus for a job - and thus believe you'll have the whole node to yourself - only to discover another user has a job running on the same node. Worse, that additional job is now slowing your job down. This is often caused by users incorrectly specifying parameters such as "ntasks" in a way that does not comport with the actual number of threads they are using (e.g. ntasks=1, but you used mpirun -np 40). In these instances, SLURM assigns your job to the node and reserves one (1) cpu core for you, but you are actually using 40. SLURM may then assign another job to that node because by its accounting, the node has 39 cpu cores available. The node is now "over-utilized" and this slows down your job. By using monitoring tools, you can evaluate your job to see if it is actually setup correctly and make changes if necessary before a problem like this occurs.

Useful SLURM Monitoring Commands

sstat

The SLURM command sstat is useful for obtaining information on your currently running jobs. Simply running "sstat <jobid>" will produce many metrics, but the output can be a bit messy. You can control the formatting of the sstat output by using specifiers with the "--format=" flag. For example:

sstat <jobid> --format=JobID,MaxRSS,AveCPU,NTasks,NodeList,TRESUsageInTot%40

...will provide the maximum memory used (MaxRSS), the average CPU utilization (#cores * runtime), the number of tasks, a list of nodes, and the resource utilization in total so far. The "%40" specifier used above with "TRESUsageInTot" provides control over the formatted field-width of the output. To see all of the available format specifier options, you may run:

sstat --helpformat

AveCPU              AveCPUFreq          AveDiskRead         AveDiskWrite       
AvePages            AveRSS              AveVMSize           ConsumedEnergy     
ConsumedEnergyRaw   JobID               MaxDiskRead         MaxDiskReadNode    
MaxDiskReadTask     MaxDiskWrite        MaxDiskWriteNode    MaxDiskWriteTask   
MaxPages            MaxPagesNode        MaxPagesTask        MaxRSS             
MaxRSSNode          MaxRSSTask          MaxVMSize           MaxVMSizeNode      
MaxVMSizeTask       MinCPU              MinCPUNode          MinCPUTask         
Nodelist            NTasks              Pids                ReqCPUFreq         
ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov       TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot

For more information on sstat, refer to the SLURM documentation.

squeue

While the squeue command is well known by most SLURM users, be aware that it is possible to obtain more information than running the default squeue command alone provides. For example, running:

squeue --format=%10i%15u%15j%5t%15M%15l%8C%30N%10D%

...will produce formatted output containing job id, username, job name, elapsed time, walltime, the number of CPUs, a list of nodes, and the number of nodes utilized. For instance:

JOBID     USER           NAME           ST   TIME           TIME_LIMIT     CPUS    NODELIST                      NODES     
76856     someuser       is_lslf        R    1:19:55        20:10:00       8       hpc-throughput-p07            1         
76855     someuser       is_rsf         R    1:30:55        20:10:00       8       hpc-throughput-p06            1         
76854     someuser       is_rlf         R    1:41:25        20:10:00       8       hpc-throughput-p05            1         
76850     otheruser      dfly_p18       R    2:14:03        2-10:10:00     32      hpc-bigmem-p02                1         
76833     newuser        mohiL-3PR      R    15:03:44       6-16:00:00     1       hpc-throughput-p01            1         
76832     newuser        mohiL-4PR      R    15:04:51       6-16:00:00     1       hpc-bigmem-p01                1

The "%#" specifiers control the field width, and the letter suffixes (e.g. "%10i") reference the format field (JobID width=10). Although the format specifiers for squeue are a bit obscure, if you find a format that is particularly useful, you can define the format to use whenever you login to Matilda by setting the value of "SQUEUE_FORMAT" in your ".bashrc" file. For example:

export SQUEUE_FORMAT="%10i%15u%15j%5t%15M%15l%8C%30N%10D%"

Refer to the SLURM squeue documentation for more information.

sacct

The sacct command is useful for reviewing the status of running or completed jobs. In its simplest form, you need only use "sacct -j <jobid>" for any running or completed/failed job. Like 'squeue' and 'sstat', the sacct command can be used with format modifiers/specifiers to obtain additional (or to filer) information. For example:

sacct -j 999888 --format=JobID%12,State,User,Account%30,TimeLimit,ReqTRES%45,Partition%15

       JobID      State      User                        Account  Timelimit                                       ReqTRES       Partition 
------------ ---------- --------- ------------------------------ ---------- --------------------------------------------- --------------- 
      999888     FAILED someuser+                myjobName-here   20:10:00            billing=8,cpu=8,mem=772512M,node=1    general-long 
999888.batch     FAILED                          myjobName-here                                                                          
999888.extern COMPLETED                          myjobName-here

This shows how long the job ran, as well as trackable resources (cpu, gpu, billing, etc.), and the partition. Note that for this job there is a primary job number, as well as variations with the ".batch" and ".extern" suffixes. These are "job steps" created by SLURM for every job (MPI jobs may have many more steps, one for each task). The former designates this as a "batch" job and tracks the resources used inside of the batch job. The latter accounts for SLURM resource specifications external to the job.

There are many possible format specifier that can be used with the sacct command. To see a list, use:

sacct --helpformat

Account             AdminComment        AllocCPUS           AllocNodes         
AllocTRES           AssocID             AveCPU              AveCPUFreq         
AveDiskRead         AveDiskWrite        AvePages            AveRSS             
AveVMSize           BlockID             Cluster             Comment            
Constraints         Container           ConsumedEnergy      ConsumedEnergyRaw  
CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode    
Elapsed             ElapsedRaw          Eligible            End                
ExitCode            Flags               GID                 Group              
JobID               JobIDRaw            JobName             Layout             
MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite       
MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode       
MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask         
MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel           
MinCPU              MinCPUNode          MinCPUTask          NCPUS              
NNodes              NodeList            NTasks              Priority           
Partition           QOS                 QOSRAW              Reason             
ReqCPUFreq          ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov      
ReqCPUS             ReqMem              ReqNodes            ReqTRES            
Reservation         ReservationId       Reserved            ResvCPU            
ResvCPURAW          Start               State               Submit             
SubmitLine          Suspended           SystemCPU           SystemComment      
Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve     
TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin     
TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve    
TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin    
TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID                
User                UserCPU             WCKey               WCKeyID            
WorkDir

For more information, refer to the SLURM documentation on sacct.

scontrol

The scontrol command is helpful in evaluating detailed information on a node or running job. To see the state of a particular node (e.g. hpc-compute-p01):

scontrol show node hpc-compute-p01

NodeName=hpc-compute-p01 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=32 CPUTot=40 CPULoad=32.01
   AvailableFeatures=local
   ActiveFeatures=local
   Gres=(null)
   NodeAddr=hpc-compute-p01 NodeHostName=hpc-compute-p01 Version=21.08.8-2
   OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022 
   RealMemory=191895 AllocMem=0 FreeMem=185655 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=general-short,general-long,rusakov 
   BootTime=2023-03-09T00:18:27 SlurmdStartTime=2023-03-09T00:20:33
   LastBusyTime=2023-03-20T05:06:59
   CfgTRES=cpu=40,mem=191895M,billing=40
   AllocTRES=cpu=32
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Here, we can see how many cores the node has (40); how many cores are allocated (32); details about system memory; and the node state (in this case, MIXED means there are jobs allocated but not all resources are assigned).

We can also see job details using:

scontrol show job id 999888

JobId=999888 JobName=slurm_omp.sh
   UserId=someuser(123456) GroupId=faculty(1002) MCS_label=N/A
   Priority=1 Nice=0 Account=rusakov-research-group QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=20:37:19 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T14:11:51 EligibleTime=2023-03-21T14:11:51
   AccrueTime=2023-03-21T14:11:51
   StartTime=2023-03-21T14:11:51 EndTime=2023-03-23T14:11:52 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T14:11:51 Scheduler=Main
   Partition=general-long AllocNode:Sid=hpc-login-p01:1446834
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=hpc-compute-p[33-35],hpc-throughput-p01
   BatchHost=hpc-compute-p33
   NumNodes=4 NumCPUs=100 NumTasks=100 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=100,node=4,billing=100
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm_omp.sh
   WorkDir=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile
   StdErr=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out
   StdIn=/dev/null
   StdOut=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out
   Power=

The result contains a plethora of information on cores, memory, nodes being utilized, and other detailed job information.

For more information, please refer to the SLURM documentation on scontrol.

seff

The SLURM seff command is useful for assessing job efficiency. Please note, that seff results for running jobs may be incorrect or misleading. However, if you run seff on a job id that has already completed, it can be useful in assessing your completed job performance. For example:

seff 999888
Job ID: 999888
Cluster: slurm
User/Group: someuser/students
State: FAILED (exit code 134)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:05:15
CPU Efficiency: 12.15% of 00:43:12 core-walltime
Job Wall-clock time: 00:05:24
Memory Utilized: 398.96 MB
Memory Efficiency: 0.00% of 0.00 MB

In the example above, the job failed. You can use seff in conjunction with the "-d" (debug) flag for additional information:

seff -d 999888
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 999888  someuser students FAILED slurm 8 1 1 0 1 315 324 408540 134

Job ID: 999888
Cluster: slurm
User/Group: someuser/students
State: FAILED (exit code 134)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:05:15
CPU Efficiency: 12.15% of 00:43:12 core-walltime
Job Wall-clock time: 00:05:24
Memory Utilized: 398.96 MB
Memory Efficiency: 0.00% of 0.00 MB

Seff is a contributor script distributed with SLURM. For more information, refer to the contribs repo for the seff command.

Integrating SLURM Commands

While SLURM monitoring commands are very useful, they can be even more powerful when used in conjunction with other resources and techniques. Make sure to checkout the documents on HPC Powertools and On-Node Monitoring for more information.