(12/22/20) Helpful SLURM Commands¶
David John Gagne
SLURM is currently the scheduler on the Casper cluster, which means it is used to manage the queueing and scheduling of jobs. You are likely very familiar with sbatch and squeue at this point. SLURM also has a whole suite of other commands that give you an incredibly detailed view into the usage of the cluster by yourself and everyone else. This blog will provide some insights to help you better manage your own jobs and keep track of how busy Casper is.
Track your job memory and CPU usage with sacct¶
sacct
queries the SLURM scheduler database to find out how well you or any other user has utilized their requested resources on a job by job basis. The default output of sacct is not very useful, but with a few alterations to the command, you can get a wealth of information.
I recommend running sacct in the following format (Note that ! allows you to run a command line program within a notebook. Do not copy the ! if you want to use the command in the terminal window):
! sacct --units=G --format="User,JobName,JobID,AllocNodes,ReqCPUs,Elapsed,CPUTime,TotalCPU,ReqMem,MaxRSS,ExitCode,State" -S 2020-12-01 -E 2020-12-31 -u dgagne
User JobName JobID AllocNodes ReqCPUS Elapsed CPUTime TotalCPU ReqMem MaxRSS ExitCode State
--------- ---------- ------------ ---------- -------- ---------- ---------- ---------- ---------- ---------- -------- ----------
dgagne sfc 6249373 1 16 00:00:06 00:01:36 00:00.683 128Gn 1:0 FAILED
batch 6249373.bat+ 1 16 00:00:06 00:01:36 00:00.682 128Gn 0 1:0 FAILED
extern 6249373.ext+ 1 16 00:00:06 00:01:36 00:00:00 128Gn 0 0:0 COMPLETED
dgagne sfc 6249377 1 16 00:00:39 00:10:24 00:12.017 128Gn 1:0 FAILED
batch 6249377.bat+ 1 16 00:00:39 00:10:24 00:12.016 128Gn 0.35G 1:0 FAILED
extern 6249377.ext+ 1 16 00:00:39 00:10:24 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6249380 1 16 00:00:23 00:06:08 00:11.773 128Gn 1:0 FAILED
batch 6249380.bat+ 1 16 00:00:23 00:06:08 00:11.772 128Gn 0 1:0 FAILED
extern 6249380.ext+ 1 16 00:00:23 00:06:08 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6249390 1 16 00:08:07 02:09:52 21:33.130 128Gn 0:0 COMPLETED
batch 6249390.bat+ 1 16 00:08:07 02:09:52 21:33.129 128Gn 1.40G 0:0 COMPLETED
extern 6249390.ext+ 1 16 00:08:08 02:10:08 00:00:00 128Gn 0 0:0 COMPLETED
dgagne sfc 6250928 1 16 00:01:47 00:28:32 01:21.477 128Gn 1:0 FAILED
batch 6250928.bat+ 1 16 00:01:47 00:28:32 01:21.476 128Gn 0.90G 1:0 FAILED
extern 6250928.ext+ 1 16 00:01:47 00:28:32 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6250961 1 16 00:01:21 00:21:36 01:19.700 128Gn 1:0 FAILED
batch 6250961.bat+ 1 16 00:01:21 00:21:36 01:19.699 128Gn 0.90G 1:0 FAILED
extern 6250961.ext+ 1 16 00:01:21 00:21:36 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6251023 1 16 00:01:58 00:31:28 01:23.110 128Gn 1:0 FAILED
batch 6251023.bat+ 1 16 00:01:58 00:31:28 01:23.109 128Gn 0.90G 1:0 FAILED
extern 6251023.ext+ 1 16 00:01:58 00:31:28 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6251026 1 16 00:04:27 01:11:12 09:42.497 128Gn 1:0 FAILED
batch 6251026.bat+ 1 16 00:04:27 01:11:12 09:42.496 128Gn 1.16G 1:0 FAILED
extern 6251026.ext+ 1 16 00:04:27 01:11:12 00:00:00 128Gn 0 0:0 COMPLETED
dgagne sfc 6257007 1 16 00:05:45 01:32:00 13:28.988 128Gn 1:0 FAILED
batch 6257007.bat+ 1 16 00:05:45 01:32:00 13:28.987 128Gn 1.24G 1:0 FAILED
extern 6257007.ext+ 1 16 00:05:45 01:32:00 00:00.001 128Gn 0 0:0 COMPLETED
dgagne sfc 6257047 1 16 00:00:03 00:00:48 00:03.086 128Gn 1:0 FAILED
batch 6257047.bat+ 1 16 00:00:03 00:00:48 00:03.085 128Gn 0 1:0 FAILED
extern 6257047.ext+ 1 16 00:00:03 00:00:48 00:00:00 128Gn 0 0:0 COMPLETED
dgagne casp_nb 6266629 1 12 06:00:00 3-00:00:00 00:22.151 256Gn 0:0 TIMEOUT
batch 6266629.bat+ 1 12 06:00:01 3-00:00:12 00:22.150 256Gn 0.30G 0:15 CANCELLED
extern 6266629.ext+ 1 12 06:00:00 3-00:00:00 00:00.001 256Gn 0 0:0 COMPLETED
dgagne sfc 6295916 1 16 00:03:06 00:49:36 05:27.670 128Gn 1:0 FAILED
batch 6295916.bat+ 1 16 00:03:06 00:49:36 05:27.669 128Gn 1.06G 1:0 FAILED
extern 6295916.ext+ 1 16 00:03:06 00:49:36 00:00:00 128Gn 0 0:0 COMPLETED
dgagne sfc 6295929 1 16 00:11:13 02:59:28 28:03.125 128Gn 0:0 COMPLETED
batch 6295929.bat+ 1 16 00:11:13 02:59:28 28:03.124 128Gn 1.38G 0:0 COMPLETED
extern 6295929.ext+ 1 16 00:11:13 02:59:28 00:00.001 128Gn 0 0:0 COMPLETED
dgagne htrainrt 6316207 1 30 00:10:11 05:05:30 48:06.423 200Gn 0:0 COMPLETED
batch 6316207.bat+ 1 30 00:10:11 05:05:30 48:06.422 200Gn 56.67G 0:0 COMPLETED
extern 6316207.ext+ 1 30 00:10:11 05:05:30 00:00.001 200Gn 0 0:0 COMPLETED
dgagne htrainrt 6316247 1 30 00:44:20 22:10:00 02:20:53 200Gn 0:0 COMPLETED
batch 6316247.bat+ 1 30 00:44:20 22:10:00 02:20:52 200Gn 104.03G 0:0 COMPLETED
extern 6316247.ext+ 1 30 00:44:21 22:10:30 00:01.001 200Gn 0.00G 0:0 COMPLETED
dgagne casp_nb 6319681 1 4 00:00:33 00:02:12 00:00.286 64Gn 0:0 COMPLETED
batch 6319681.bat+ 1 4 00:00:33 00:02:12 00:00.285 64Gn 0.05G 0:0 COMPLETED
extern 6319681.ext+ 1 4 00:00:33 00:02:12 00:00.001 64Gn 0 0:0 COMPLETED
dgagne casp_nb 6319684 1 8 00:00:32 00:04:16 00:00.280 64Gn 0:0 COMPLETED
batch 6319684.bat+ 1 8 00:00:32 00:04:16 00:00.280 64Gn 0.05G 0:0 COMPLETED
extern 6319684.ext+ 1 8 00:00:32 00:04:16 00:00:00 64Gn 0 0:0 COMPLETED
The command breaks down into these parts:
--units=G
: Print all memory-related outputs in Gigabytes. You can also use M or K for Megabytes and Kilobytes--format="..."
: The list of columns to output. The full list can be found here.-S 2020-12-01
: The start date for the query. Can be adjusted so only recent jobs are visible.-E 2020-12-31
: The end date for the query.-u dgagne
: The username. Can be a comma separated list of users, like-u dgagne,cbecker,schreck,ggantos
What does the output mean? The most relevant comparisons relate to CPU and memory usage.
Elapsed: total time the job runs in Day-Hour:Minute:Second format.
CPUTime: total time the CPUs are allocated, which should be close to Elapsed * ReqCPUs.
TotalCPU: The total amount of time the CPUs are in use by the user or the system. If this is far less than CPUTime, then you are requesting too many CPUs for your job. Note that TotalCPU does not account for child processes, so if you are running multiprocessing or dask, this number may be deceptively low.
For memory usage
ReqMem: The total amount of memory the job requested.
MaxRSS: The maximum amount of memory the job used. If MaxRSS is far less than ReqMem, then decrease future memory requests. If it is the same or close to the same as ReqMem and your job is taking a longer than expected time to run, the program may be swapping memory to disk. You should ask for more memory in that case.
Track current cluster usage with sinfo¶
sinfo
prints out information about the current usage of every node in the cluster. It is helpful to see which nodes have what resources, and you can see how busy each node is. This may be especially useful when you are about to launch a multi-GPU or large memory job and want to make sure the memory and GPUs are available. The default sinfo
call provides a very high level summary. Just like sacct
, I recommend running the following command:
! sinfo --Format="NodeHost:15,Features:50,CPUs:5,CPUsLoad:10,Gres:30,GresUsed:50,AllocMem:.15,FreeMem:.15 ,StateLong:15,Available:6"
HOSTNAMES AVAIL_FEATURES CPUS CPU_LOAD GRES GRES_USED ALLOCMEM FREE_MEM STATE AVAIL
casper23 casper,skylake,mlx5_0,gp100,gpu,x11 72 0.16 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 0 342060 drained up
casper20 casper,skylake,mlx5_0 72 0.56 (null) gpu:0,mps:0 0 295669 reserved up
casper25 casper,skylake,mlx5_0,4xv100,v100,gpu 72 0.04 gpu:v100:4,mps:v100:400 gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A) 0 734630 reserved up
casper28 casper,skylake,mlx5_0,8xv100,v100,gpu 72 0.01 gpu:v100:8,mps:v100:800 gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A) 0 1123240 reserved up
casper01 casper,skylake,mlx5_0 72 1.39 (null) gpu:0,mps:0 247808 217182 mixed up
casper02 casper,skylake,mlx5_0 72 0.62 (null) gpu:0,mps:0 380044 335151 mixed up
casper03 casper,skylake,mlx5_0 72 17.92 (null) gpu:0,mps:0 382534 325258 mixed up
casper04 casper,skylake,mlx5_0 72 2.29 (null) gpu:0,mps:0 379904 305334 mixed up
casper05 casper,skylake,mlx5_0 72 18.23 (null) gpu:0,mps:0 374784 314173 mixed up
casper06 casper,skylake,mlx5_0,gp100,gpu,x11 72 3.21 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 355328 337188 mixed up
casper07 casper,skylake,mlx5_0,gp100,gpu,x11 72 3.14 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 310854 341183 mixed up
casper09 casper,skylake,mlx5_0,4xv100,v100,gpu 72 3.14 gpu:v100:4,mps:v100:400 gpu:v100:4(IDX:0-3),mps:v100:0(IDX:N/A) 349452 649024 mixed up
casper10 casper,skylake,mlx5_0 72 5.44 (null) gpu:0,mps:0 370968 332526 mixed up
casper11 casper,skylake,mlx5_0 72 3.79 (null) gpu:0,mps:0 384140 336050 mixed up
casper12 casper,skylake,mlx5_0 72 3.43 (null) gpu:0,mps:0 381952 325012 mixed up
casper13 casper,skylake,mlx5_0 72 3.70 (null) gpu:0,mps:0 384582 333591 mixed up
casper14 casper,skylake,mlx5_0,gp100,gpu,x11 72 3.27 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 358982 342191 mixed up
casper15 casper,skylake,mlx5_0,gp100,gpu,x11 72 2.29 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 307200 341551 mixed up
casper16 casper,skylake,mlx5_0,gp100,gpu,x11 72 1.20 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 204800 344260 mixed up
casper17 casper,skylake,mlx5_0,gp100,gpu,x11 72 1.28 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 364544 343480 mixed up
casper18 casper,skylake,mlx5_0 72 17.04 (null) gpu:0,mps:0 382608 248991 mixed up
casper19 casper,skylake,mlx5_0 72 31.76 (null) gpu:0,mps:0 352150 259655 mixed up
casper22 casper,skylake,mlx5_0,gp100,gpu,x11 72 3.19 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 310854 336178 mixed up
casper24 casper,skylake,mlx5_0,8xv100,v100,gpu 72 10.09 gpu:v100:8,mps:v100:800 gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A) 256000 1053921 mixed up
casper26 casper,skylake,mlx5_0,gp100,gpu,x11 72 3.68 gpu:gp100:1 gpu:gp100:0(IDX:N/A),mps:0 323142 296209 mixed up
casper27 casper,skylake,mlx5_0,8xv100,v100,gpu 72 0.04 gpu:v100:8,mps:v100:800 gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A) 307200 1115535 mixed up
casper29 casper,cascadelake,mlx5_0,4xv100,v100,gpu 72 0.84 gpu:v100:4,mps:v100:400 gpu:v100:3(IDX:0-2),mps:v100:0(IDX:N/A) 277062 728882 mixed up
casper30 casper,cascadelake,mlx5_0,8xv100,v100,gpu 72 8.40 gpu:v100:8,mps:v100:800 gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A) 51200 1060850 mixed up
casper31 casper,cascadelake,mlx5_0,8xv100,v100,gpu 72 8.32 gpu:v100:8,mps:v100:800 gpu:v100:8(IDX:0-7),mps:v100:0(IDX:N/A) 51200 1059902 mixed up
casper36 casper,cascadelake,mlx5_0,4xv100,v100,gpu 72 52.14 gpu:v100:4,mps:v100:400 gpu:v100:2(IDX:0,2),mps:v100:0(IDX:N/A) 671744 641858 mixed up
casper21 casper,skylake,mlx5_0 72 20.37 (null) gpu:0,mps:0 373532 161434 allocated up
casper08 casper,skylake,mlx5_0,8xv100,v100,gpu 72 0.01 gpu:v100:8,mps:v100:800 gpu:v100:0(IDX:N/A),mps:v100:0(IDX:N/A) 0 1114594 idle up
gladeslurm1 hsi 16 11.48 (null) gpu:0,mps:0 0 16338 idle up
gladeslurm2 hsi 16 14.10 (null) gpu:0,mps:0 0 15644 idle up
gladeslurm3 hsi 16 5.25 (null) gpu:0,mps:0 0 12911 idle up
gladeslurm4 hsi 16 4.20 (null) gpu:0,mps:0 0 14423 idle up
The columns perform the following uses:
NodeHost: Prints the name of each node.
Features: Lists the CPU type (skylake or cascadelake), and the number and type of GPU if any
CPUs: Number of CPUs available, which is number of sockets * number of cores * threads per core (only for multithreading)
CPUsLoad: How many CPUs are currently being used
Gres: Number and type of GPUs
GresUsed: How many GPUs are currently allocated on the node
AllocMem: How much memory is allocated in MB
FreeMem: How much memory is free in MB
StateLong: Node usage, which can be idle, mixed, allocated, reserved, or drained
Available: up or down