Skip to content

Slurm Command-Line (Batch Jobs)

Slurm runs your job on one of your lab's GPU nodes. The partition, the queue, the nodes, and everyone else in the queue are all your own lab — there's no shared campus pool to compete against.

If you don't need a graphical session, or you want to run something long-running and come back later, submit a batch job via sbatch. You can do this from:

  • Any desktop session (open a terminal).
  • OOD → Clusters → Research Slurm Shell Access in the portal (no full desktop needed, quick CLI access — lands you on the Slurm submit host).
  • Direct SSH (see direct-ssh.md).

One-minute intro

A batch job is a shell script with special comment lines at the top telling Slurm what resources you need:

#!/bin/bash
#SBATCH --job-name=my-training
#SBATCH --partition=<your-lab>    # e.g. lincheng or inspire
#SBATCH --time=04:00:00            # hh:mm:ss
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:1
#SBATCH --output=/mnt/lab-research/logs/%x-%j.out

module load miniconda3
conda activate pytorch

cd /mnt/lab-research/my-project
python train.py --epochs 50

Save as run.sh, then:

sbatch run.sh
# → Submitted batch job 42

Check on it:

squeue -u $USER
# Or "Active Jobs" in the OOD portal

Cancel it:

scancel 42

Picking a partition

Your lab has its own partition, named after the lab, made up of the nodes your lab owns. You can only submit to your own lab's partition — Slurm rejects cross-lab submissions:

sinfo                        # lists partitions you have access to
sinfo -p <your-lab> -N -l    # the nodes that make up your lab's partition

Other labs' partitions exist on the same Slurm controller, but their nodes are theirs. You're not competing for their resources, and they aren't competing for yours.

Checking what's available right now

Before you launch a desktop or submit a long batch job, it helps to see whether nodes are idle (you'll start immediately), mixed (some capacity left), or fully allocated (you'll queue).

Are my lab's nodes free?

sinfo -p <your-lab>

The STATE column tells you:

State What it means for you
idle Node is empty — your job will start immediately.
mix Node has capacity for more jobs (oversubscribe — see below).
alloc Node is fully booked at the OverSubscribe limit; new jobs queue.
drain/drng Node is being prepared for maintenance — won't take new jobs.
down/fail Node is broken; ignore. Ask IT if it stays this way.

Lab partitions oversubscribe (OverSubscribe=FORCE:5), so up to ~5 jobs can land on the same physical node. mix is the normal steady state — you can still launch.

Resource detail per node

sinfo -p <your-lab> -N -o "%N %t %C %m %G %E"
#                          │  │  │  │  │  └─ reason if drained/down
#                          │  │  │  │  └──── GRES (e.g. gpu:4)
#                          │  │  │  └─────── memory (MB)
#                          │  │  └────────── CPUs (Allocated/Idle/Other/Total)
#                          │  └───────────── state
#                          └──────────────── node name

CPUs 64/0/0/64 means all 64 cores are allocated. 0/64/0/64 means all idle. Same shape for memory and GRES.

What's running on a node?

squeue -w <node-fqdn>          # jobs currently on a specific node
squeue -p <your-lab>           # everything in your lab's queue
squeue -p <your-lab> -t R      # only running jobs
squeue -p <your-lab> -t PD     # only pending — see why below

The USER column shows whose job it is — and since the partition is just your lab, everyone you see is a labmate. When the queue is full this is also how you find out which labmate to politely poke about that overnight job they forgot to cancel.

Why is my queued job not starting?

squeue -u $USER -t PD -o "%i %j %T %r %S"
#                         │  │  │  │  └─ estimated start time (if known)
#                         │  │  │  └──── reason
#                         │  │  └─────── state
#                         │  └────────── job name
#                         └───────────── job ID

Common reasons in the REASON column:

Reason What to do
Resources Nodes match your request but are full. Wait, or ask for fewer resources.
Priority Another job is ahead of you. Will start when it does.
ReqNodeNotAvail All matching nodes are drained / reserved. Ask IT.
JobHeldUser / JobHeldAdmin You or an admin held the job. scontrol release <jobid>.
AssocGrpCPURunMinutesLimit (etc.) You're at an account limit. Talk to your PI.

What does my job actually have?

scontrol show job <jobid>      # full record: nodes, GRES, walltime, env
sstat -j <jobid> --format=JobID,AveCPU,AveRSS,MaxRSS  # live stats while running

MaxRSS is the peak memory used so far — useful for right-sizing --mem on the next submit.

Past usage — yours

sacct -u $USER -S $(date -d '7 days ago' +%F) \
      --format=JobID,JobName,State,Elapsed,ReqTRES%30,AllocTRES%30

Elapsed vs walltime tells you whether you over-requested time. MaxRSS (add MaxRSS%10) tells you whether you over-requested memory. Right-sizing future jobs improves how fast they schedule.

Your lab at a glance

sinfo -p <your-lab> -o "%P %a %l %D %t"   # partition / avail / max time / nodes / state

A useful quick-glance: how many of your lab's nodes are up, drained, or fully booked, and the partition's max walltime.

Useful commands

sbatch <script>              # submit
squeue -u $USER              # your jobs
squeue -p <your-lab>         # your lab's queue
scancel <jobid>              # cancel
scancel -u $USER             # cancel all of yours
sacct -u $USER -S today      # history
sacct -j <jobid> --format=JobID,State,Elapsed,MaxRSS,AllocTRES
scontrol show job <jobid>    # full details

Directives you'll use often

Directive What it does
--job-name=<x> Name shown in squeue
--partition=<lab> Which lab's queue
--time=HH:MM:SS Walltime (required; node is reclaimed at expiry)
--cpus-per-task=N CPU cores
--mem=N[G] RAM
--gres=gpu:N Whole GPUs
--gres=shard:N Fractional GPU (if your lab has MPS enabled)
--output=<path> stdout file. %x = job name, %j = job ID
--error=<path> stderr file; omit to merge with stdout
--array=1-100 Array job — run the script 100 times with $SLURM_ARRAY_TASK_ID
--dependency=afterok:<jobid> Run only if an earlier job succeeded
--mail-type=END,FAIL Email when job ends/fails
--mail-user=you@umd.edu Where to mail

Interactive session without the desktop

Sometimes you just want a shell on a GPU node (e.g. to debug, run a REPL):

srun --partition=<your-lab> --time=01:00:00 \
     --cpus-per-task=4 --mem=16G --gres=gpu:1 \
     --pty bash -l

You'll land in a shell on one of your lab's compute nodes, with the GPU reserved for you. Exit the shell to release it.

Arrays — parallel sweep

Running the same thing with a parameter sweep:

#SBATCH --job-name=sweep
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=4 --mem=8G --gres=gpu:1
#SBATCH --output=logs/sweep-%a.out

python train.py --lr 0.00$SLURM_ARRAY_TASK_ID

sbatch returns one job ID but Slurm actually queues 10 tasks, each with its own $SLURM_ARRAY_TASK_ID (1 through 10). The scheduler runs them in parallel up to the queue's capacity.

Useful environment variables inside a job

Variable What
$SLURM_JOB_ID Job ID
$SLURM_ARRAY_TASK_ID Array index (arrays only)
$SLURM_CPUS_PER_TASK Cores you got
$SLURM_GPUS_ON_NODE GPUs allocated
$TMPDIR /scratch/<jobid> on the node

What Slurm sees

  • Running: the job is using the allocation.
  • Pending (PD) with reason Resources or Priority: queued, waiting for a labmate's job to finish or yield priority.
  • Pending with ReqNodeNotAvail: every node in your lab's set is drained or reserved; email eit-help@umd.edu.
  • Completing (CG): finishing cleanup; usually seconds.
  • Failed (F) / Cancelled (CA): you can tell from sacct.

Checking a failed job

sacct -j <jobid> --format=JobID,State,ExitCode,Reason,NodeList
# Then look at the output log you specified with --output
cat /mnt/lab-research/logs/<jobname>-<jobid>.out

Common failure patterns:

  • Exit code 137 — OOM killed. Increase --mem.
  • Exit code 124 — Hit walltime. Increase --time.
  • Exit code non-zero, "CUDA out of memory" in the log — Your model is too big for one GPU or a labmate is also on the same GPU (MPS); bump --mem, reduce batch size, or request a whole GPU.

Running a quick job from the OOD shell

In OOD → Clusters → Research Slurm Shell Access, one-liner:

sbatch --wrap="module load miniconda3 && conda activate pytorch && python -c 'import torch; print(torch.cuda.is_available())'" \
       --partition=<your-lab> --time=00:05:00 --gres=gpu:1 --mem=4G

Then squeue -u $USER to see it run.

Output locations

Point --output at lab storage, not ~. If you forget, you'll get a log file in the job's submission directory (wherever you ran sbatch), which may or may not be on persistent storage.