How to work with SLURM

Home/Sin categoría/How to work with SLURM

How to work with SLURM

General commands

man sbatch
man squeue
man scancel

Submitting jobs


#SBATCH -p general # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH -n 1 # number of cores
#SBATCH –mem 100 # memory pool for all cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
#SBATCH –mail-type=END,FAIL # notifications for job done & fail
#SBATCH – # send-to address

for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt

sort SomeRandomNumbers.txt

Now you can submit your job with the command:

sbatch myscript.scr

Information on jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List priority order of jobs for the current user (you) in a given partition:

showq-slurm -o -U -q <partition>

List all current jobs in the general partition for a user:

squeue -u <username> -p general

List detailed information for a job (useful for troubleshooting):

scontrol show jobid -dd <jobid>

List status info for a currently running job:

sstat –format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> –allsteps

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.

To get statistics on completed jobs by jobID:

sacct -j <jobid> –format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> –format=JobID,JobName,MaxRSS,Elapsed

Controlling jobs

To cancel one job:

scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

To cancel one or more jobs by name:

scancel –name myJobName

To pause a particular job:

scontrol hold <jobid>

To resume a particular job:

scontrol resume <jobid>

To requeue (cancel and rerun) a particular job:

scontrol requeue <jobid>
By | 2017-05-18T23:04:45+00:00 mayo 18th, 2017|Sin categoría|0 Comments

About the Author:

Leave A Comment