(12/22/20) Helpful SLURM Commands¶

David John Gagne

SLURM is currently the scheduler on the Casper cluster, which means it is used to manage the queueing and scheduling of jobs. You are likely very familiar with sbatch and squeue at this point. SLURM also has a whole suite of other commands that give you an incredibly detailed view into the usage of the cluster by yourself and everyone else. This blog will provide some insights to help you better manage your own jobs and keep track of how busy Casper is.

Track your job memory and CPU usage with sacct¶

sacct queries the SLURM scheduler database to find out how well you or any other user has utilized their requested resources on a job by job basis. The default output of sacct is not very useful, but with a few alterations to the command, you can get a wealth of information.

I recommend running sacct in the following format (Note that ! allows you to run a command line program within a notebook. Do not copy the ! if you want to use the command in the terminal window):

! sacct --units=G --format="User,JobName,JobID,AllocNodes,ReqCPUs,Elapsed,CPUTime,TotalCPU,ReqMem,MaxRSS,ExitCode,State" -S 2020-12-01 -E 2020-12-31 -u dgagne

     User    JobName        JobID AllocNodes  ReqCPUS    Elapsed    CPUTime   TotalCPU     ReqMem     MaxRSS ExitCode      State 
--------- ---------- ------------ ---------- -------- ---------- ---------- ---------- ---------- ---------- -------- ---------- 
   dgagne        sfc 6249373               1       16   00:00:06   00:01:36  00:00.683      128Gn                 1:0     FAILED 
               batch 6249373.bat+          1       16   00:00:06   00:01:36  00:00.682      128Gn          0      1:0     FAILED 
              extern 6249373.ext+          1       16   00:00:06   00:01:36   00:00:00      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6249377               1       16   00:00:39   00:10:24  00:12.017      128Gn                 1:0     FAILED 
               batch 6249377.bat+          1       16   00:00:39   00:10:24  00:12.016      128Gn      0.35G      1:0     FAILED 
              extern 6249377.ext+          1       16   00:00:39   00:10:24  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6249380               1       16   00:00:23   00:06:08  00:11.773      128Gn                 1:0     FAILED 
               batch 6249380.bat+          1       16   00:00:23   00:06:08  00:11.772      128Gn          0      1:0     FAILED 
              extern 6249380.ext+          1       16   00:00:23   00:06:08  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6249390               1       16   00:08:07   02:09:52  21:33.130      128Gn                 0:0  COMPLETED 
               batch 6249390.bat+          1       16   00:08:07   02:09:52  21:33.129      128Gn      1.40G      0:0  COMPLETED 
              extern 6249390.ext+          1       16   00:08:08   02:10:08   00:00:00      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6250928               1       16   00:01:47   00:28:32  01:21.477      128Gn                 1:0     FAILED 
               batch 6250928.bat+          1       16   00:01:47   00:28:32  01:21.476      128Gn      0.90G      1:0     FAILED 
              extern 6250928.ext+          1       16   00:01:47   00:28:32  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6250961               1       16   00:01:21   00:21:36  01:19.700      128Gn                 1:0     FAILED 
               batch 6250961.bat+          1       16   00:01:21   00:21:36  01:19.699      128Gn      0.90G      1:0     FAILED 
              extern 6250961.ext+          1       16   00:01:21   00:21:36  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6251023               1       16   00:01:58   00:31:28  01:23.110      128Gn                 1:0     FAILED 
               batch 6251023.bat+          1       16   00:01:58   00:31:28  01:23.109      128Gn      0.90G      1:0     FAILED 
              extern 6251023.ext+          1       16   00:01:58   00:31:28  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6251026               1       16   00:04:27   01:11:12  09:42.497      128Gn                 1:0     FAILED 
               batch 6251026.bat+          1       16   00:04:27   01:11:12  09:42.496      128Gn      1.16G      1:0     FAILED 
              extern 6251026.ext+          1       16   00:04:27   01:11:12   00:00:00      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6257007               1       16   00:05:45   01:32:00  13:28.988      128Gn                 1:0     FAILED 
               batch 6257007.bat+          1       16   00:05:45   01:32:00  13:28.987      128Gn      1.24G      1:0     FAILED 
              extern 6257007.ext+          1       16   00:05:45   01:32:00  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6257047               1       16   00:00:03   00:00:48  00:03.086      128Gn                 1:0     FAILED 
               batch 6257047.bat+          1       16   00:00:03   00:00:48  00:03.085      128Gn          0      1:0     FAILED 
              extern 6257047.ext+          1       16   00:00:03   00:00:48   00:00:00      128Gn          0      0:0  COMPLETED 
   dgagne    casp_nb 6266629               1       12   06:00:00 3-00:00:00  00:22.151      256Gn                 0:0    TIMEOUT 
               batch 6266629.bat+          1       12   06:00:01 3-00:00:12  00:22.150      256Gn      0.30G     0:15  CANCELLED 
              extern 6266629.ext+          1       12   06:00:00 3-00:00:00  00:00.001      256Gn          0      0:0  COMPLETED 
   dgagne        sfc 6295916               1       16   00:03:06   00:49:36  05:27.670      128Gn                 1:0     FAILED 
               batch 6295916.bat+          1       16   00:03:06   00:49:36  05:27.669      128Gn      1.06G      1:0     FAILED 
              extern 6295916.ext+          1       16   00:03:06   00:49:36   00:00:00      128Gn          0      0:0  COMPLETED 
   dgagne        sfc 6295929               1       16   00:11:13   02:59:28  28:03.125      128Gn                 0:0  COMPLETED 
               batch 6295929.bat+          1       16   00:11:13   02:59:28  28:03.124      128Gn      1.38G      0:0  COMPLETED 
              extern 6295929.ext+          1       16   00:11:13   02:59:28  00:00.001      128Gn          0      0:0  COMPLETED 
   dgagne   htrainrt 6316207               1       30   00:10:11   05:05:30  48:06.423      200Gn                 0:0  COMPLETED 
               batch 6316207.bat+          1       30   00:10:11   05:05:30  48:06.422      200Gn     56.67G      0:0  COMPLETED 
              extern 6316207.ext+          1       30   00:10:11   05:05:30  00:00.001      200Gn          0      0:0  COMPLETED 
   dgagne   htrainrt 6316247               1       30   00:44:20   22:10:00   02:20:53      200Gn                 0:0  COMPLETED 
               batch 6316247.bat+          1       30   00:44:20   22:10:00   02:20:52      200Gn    104.03G      0:0  COMPLETED 
              extern 6316247.ext+          1       30   00:44:21   22:10:30  00:01.001      200Gn      0.00G      0:0  COMPLETED 
   dgagne    casp_nb 6319681               1        4   00:00:33   00:02:12  00:00.286       64Gn                 0:0  COMPLETED 
               batch 6319681.bat+          1        4   00:00:33   00:02:12  00:00.285       64Gn      0.05G      0:0  COMPLETED 
              extern 6319681.ext+          1        4   00:00:33   00:02:12  00:00.001       64Gn          0      0:0  COMPLETED 
   dgagne    casp_nb 6319684               1        8   00:00:32   00:04:16  00:00.280       64Gn                 0:0  COMPLETED 
               batch 6319684.bat+          1        8   00:00:32   00:04:16  00:00.280       64Gn      0.05G      0:0  COMPLETED 
              extern 6319684.ext+          1        8   00:00:32   00:04:16   00:00:00       64Gn          0      0:0  COMPLETED

The command breaks down into these parts:

--units=G: Print all memory-related outputs in Gigabytes. You can also use M or K for Megabytes and Kilobytes
--format="...": The list of columns to output. The full list can be found here.
-S 2020-12-01: The start date for the query. Can be adjusted so only recent jobs are visible.
-E 2020-12-31: The end date for the query.
-u dgagne: The username. Can be a comma separated list of users, like -u dgagne,cbecker,schreck,ggantos

What does the output mean? The most relevant comparisons relate to CPU and memory usage.

Elapsed: total time the job runs in Day-Hour:Minute:Second format.
CPUTime: total time the CPUs are allocated, which should be close to Elapsed * ReqCPUs.
TotalCPU: The total amount of time the CPUs are in use by the user or the system. If this is far less than CPUTime, then you are requesting too many CPUs for your job. Note that TotalCPU does not account for child processes, so if you are running multiprocessing or dask, this number may be deceptively low.

For memory usage

ReqMem: The total amount of memory the job requested.
MaxRSS: The maximum amount of memory the job used. If MaxRSS is far less than ReqMem, then decrease future memory requests. If it is the same or close to the same as ReqMem and your job is taking a longer than expected time to run, the program may be swapping memory to disk. You should ask for more memory in that case.

Track current cluster usage with sinfo¶

sinfo prints out information about the current usage of every node in the cluster. It is helpful to see which nodes have what resources, and you can see how busy each node is. This may be especially useful when you are about to launch a multi-GPU or large memory job and want to make sure the memory and GPUs are available. The default sinfo call provides a very high level summary. Just like sacct, I recommend running the following command:

! sinfo --Format="NodeHost:15,Features:50,CPUs:5,CPUsLoad:10,Gres:30,GresUsed:50,AllocMem:.15,FreeMem:.15  ,StateLong:15,Available:6"

HOSTNAMES      AVAIL_FEATURES                                    CPUS CPU_LOAD  GRES                          GRES_USED                                                ALLOCMEM       FREE_MEM  STATE            AVAIL   
casper23       casper,skylake,mlx5_0,gp100,gpu,x11               72   0.16      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                      0         342060  drained          up      
casper20       casper,skylake,mlx5_0                             72   0.56      (null)                        gpu:0,mps:0                                                     0         295669  reserved         up      
casper25       casper,skylake,mlx5_0,4xv100,v100,gpu             72   0.04      gpu:v100:4,mps:v100:400       gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A)                         0         734630  reserved         up      
casper28       casper,skylake,mlx5_0,8xv100,v100,gpu             72   0.01      gpu:v100:8,mps:v100:800       gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A)                         0        1123240  reserved         up      
casper01       casper,skylake,mlx5_0                             72   1.39      (null)                        gpu:0,mps:0                                                247808         217182  mixed            up      
casper02       casper,skylake,mlx5_0                             72   0.62      (null)                        gpu:0,mps:0                                                380044         335151  mixed            up      
casper03       casper,skylake,mlx5_0                             72   17.92     (null)                        gpu:0,mps:0                                                382534         325258  mixed            up      
casper04       casper,skylake,mlx5_0                             72   2.29      (null)                        gpu:0,mps:0                                                379904         305334  mixed            up      
casper05       casper,skylake,mlx5_0                             72   18.23     (null)                        gpu:0,mps:0                                                374784         314173  mixed            up      
casper06       casper,skylake,mlx5_0,gp100,gpu,x11               72   3.21      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 355328         337188  mixed            up      
casper07       casper,skylake,mlx5_0,gp100,gpu,x11               72   3.14      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 310854         341183  mixed            up      
casper09       casper,skylake,mlx5_0,4xv100,v100,gpu             72   3.14      gpu:v100:4,mps:v100:400       gpu:v100:4(IDX:0-3),mps:v100:0(IDX:N/A)                    349452         649024  mixed            up      
casper10       casper,skylake,mlx5_0                             72   5.44      (null)                        gpu:0,mps:0                                                370968         332526  mixed            up      
casper11       casper,skylake,mlx5_0                             72   3.79      (null)                        gpu:0,mps:0                                                384140         336050  mixed            up      
casper12       casper,skylake,mlx5_0                             72   3.43      (null)                        gpu:0,mps:0                                                381952         325012  mixed            up      
casper13       casper,skylake,mlx5_0                             72   3.70      (null)                        gpu:0,mps:0                                                384582         333591  mixed            up      
casper14       casper,skylake,mlx5_0,gp100,gpu,x11               72   3.27      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 358982         342191  mixed            up      
casper15       casper,skylake,mlx5_0,gp100,gpu,x11               72   2.29      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 307200         341551  mixed            up      
casper16       casper,skylake,mlx5_0,gp100,gpu,x11               72   1.20      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 204800         344260  mixed            up      
casper17       casper,skylake,mlx5_0,gp100,gpu,x11               72   1.28      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 364544         343480  mixed            up      
casper18       casper,skylake,mlx5_0                             72   17.04     (null)                        gpu:0,mps:0                                                382608         248991  mixed            up      
casper19       casper,skylake,mlx5_0                             72   31.76     (null)                        gpu:0,mps:0                                                352150         259655  mixed            up      
casper22       casper,skylake,mlx5_0,gp100,gpu,x11               72   3.19      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 310854         336178  mixed            up      
casper24       casper,skylake,mlx5_0,8xv100,v100,gpu             72   10.09     gpu:v100:8,mps:v100:800       gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A)                    256000        1053921  mixed            up      
casper26       casper,skylake,mlx5_0,gp100,gpu,x11               72   3.68      gpu:gp100:1                   gpu:gp100:0(IDX:N/A),mps:0                                 323142         296209  mixed            up      
casper27       casper,skylake,mlx5_0,8xv100,v100,gpu             72   0.04      gpu:v100:8,mps:v100:800       gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A)                    307200        1115535  mixed            up      
casper29       casper,cascadelake,mlx5_0,4xv100,v100,gpu         72   0.84      gpu:v100:4,mps:v100:400       gpu:v100:3(IDX:0-2),mps:v100:0(IDX:N/A)                    277062         728882  mixed            up      
casper30       casper,cascadelake,mlx5_0,8xv100,v100,gpu         72   8.40      gpu:v100:8,mps:v100:800       gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A)                     51200        1060850  mixed            up      
casper31       casper,cascadelake,mlx5_0,8xv100,v100,gpu         72   8.32      gpu:v100:8,mps:v100:800       gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A)                     51200        1059902  mixed            up      
casper36       casper,cascadelake,mlx5_0,4xv100,v100,gpu         72   52.14     gpu:v100:4,mps:v100:400       gpu:v100:2(IDX:0,2),mps:v100:0(IDX:N/A)                    671744         641858  mixed            up      
casper21       casper,skylake,mlx5_0                             72   20.37     (null)                        gpu:0,mps:0                                                373532         161434  allocated        up      
casper08       casper,skylake,mlx5_0,8xv100,v100,gpu             72   0.01      gpu:v100:8,mps:v100:800       gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A)                         0        1114594  idle             up      
gladeslurm1    hsi                                               16   11.48     (null)                        gpu:0,mps:0                                                     0          16338  idle             up      
gladeslurm2    hsi                                               16   14.10     (null)                        gpu:0,mps:0                                                     0          15644  idle             up      
gladeslurm3    hsi                                               16   5.25      (null)                        gpu:0,mps:0                                                     0          12911  idle             up      
gladeslurm4    hsi                                               16   4.20      (null)                        gpu:0,mps:0                                                     0          14423  idle             up

The columns perform the following uses:

NodeHost: Prints the name of each node.
Features: Lists the CPU type (skylake or cascadelake), and the number and type of GPU if any
CPUs: Number of CPUs available, which is number of sockets * number of cores * threads per core (only for multithreading)
CPUsLoad: How many CPUs are currently being used
Gres: Number and type of GPUs
GresUsed: How many GPUs are currently allocated on the node
AllocMem: How much memory is allocated in MB
FreeMem: How much memory is free in MB
StateLong: Node usage, which can be idle, mixed, allocated, reserved, or drained
Available: up or down

Analytics and Integrative Machine Learning

(12/22/20) Helpful SLURM Commands¶

Track your job memory and CPU usage with sacct¶

Track current cluster usage with sinfo¶