Slurm Command-Line (Batch Jobs)¶

Slurm runs your job on one of your lab's GPU nodes. The partition, the queue, the nodes, and everyone else in the queue are all your own lab — there's no shared campus pool to compete against.

If you don't need a graphical session, or you want to run something long-running and come back later, submit a batch job via sbatch. You can do this from:

Any desktop session (open a terminal).
OOD → Clusters → Research Slurm Shell Access in the portal (no full desktop needed, quick CLI access — lands you on the Slurm submit host).
Direct SSH (see direct-ssh.md).

One-minute intro¶

A batch job is a shell script with special comment lines at the top telling Slurm what resources you need:

#!/bin/bash
#SBATCH --job-name=my-training
#SBATCH --partition=<your-lab>    # e.g. lincheng or inspire
#SBATCH --time=04:00:00            # hh:mm:ss
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:1
#SBATCH --output=/mnt/lab-research/logs/%x-%j.out

module load miniconda3
conda activate pytorch

cd /mnt/lab-research/my-project
python train.py --epochs 50

Save as run.sh, then:

sbatch run.sh
# → Submitted batch job 42

Check on it:

squeue -u $USER
# Or "Active Jobs" in the OOD portal

Cancel it:

scancel 42

Picking a partition¶

Your lab has its own partition, named after the lab, made up of the nodes your lab owns. You can only submit to your own lab's partition — Slurm rejects cross-lab submissions:

sinfo                        # lists partitions you have access to
sinfo -p <your-lab> -N -l    # the nodes that make up your lab's partition

Other labs' partitions exist on the same Slurm controller, but their nodes are theirs. You're not competing for their resources, and they aren't competing for yours.

Checking what's available right now¶

Before you launch a desktop or submit a long batch job, it helps to see whether nodes are idle (you'll start immediately), mixed (some capacity left), or fully allocated (you'll queue).

Are my lab's nodes free?¶

sinfo -p <your-lab>

The STATE column tells you:

State	What it means for you
`idle`	Node is empty — your job will start immediately.
`mix`	Node has capacity for more jobs (oversubscribe — see below).
`alloc`	Node is fully booked at the OverSubscribe limit; new jobs queue.
`drain`/`drng`	Node is being prepared for maintenance — won't take new jobs.
`down`/`fail`	Node is broken; ignore. Ask IT if it stays this way.

Lab partitions oversubscribe (OverSubscribe=FORCE:5), so up to ~5 jobs can land on the same physical node. mix is the normal steady state — you can still launch.

Resource detail per node¶

sinfo -p <your-lab> -N -o "%N %t %C %m %G %E"
#                          │  │  │  │  │  └─ reason if drained/down
#                          │  │  │  │  └──── GRES (e.g. gpu:4)
#                          │  │  │  └─────── memory (MB)
#                          │  │  └────────── CPUs (Allocated/Idle/Other/Total)
#                          │  └───────────── state
#                          └──────────────── node name

CPUs 64/0/0/64 means all 64 cores are allocated. 0/64/0/64 means all idle. Same shape for memory and GRES.

What's running on a node?¶

squeue -w <node-fqdn>          # jobs currently on a specific node
squeue -p <your-lab>           # everything in your lab's queue
squeue -p <your-lab> -t R      # only running jobs
squeue -p <your-lab> -t PD     # only pending — see why below

The USER column shows whose job it is — and since the partition is just your lab, everyone you see is a labmate. When the queue is full this is also how you find out which labmate to politely poke about that overnight job they forgot to cancel.

Why is my queued job not starting?¶

squeue -u $USER -t PD -o "%i %j %T %r %S"
#                         │  │  │  │  └─ estimated start time (if known)
#                         │  │  │  └──── reason
#                         │  │  └─────── state
#                         │  └────────── job name
#                         └───────────── job ID

Common reasons in the REASON column:

Reason	What to do
`Resources`	Nodes match your request but are full. Wait, or ask for fewer resources.
`Priority`	Another job is ahead of you. Will start when it does.
`ReqNodeNotAvail`	All matching nodes are drained / reserved. Ask IT.
`JobHeldUser` / `JobHeldAdmin`	You or an admin held the job. `scontrol release <jobid>`.
`AssocGrpCPURunMinutesLimit` (etc.)	You're at an account limit. Talk to your PI.

What does my job actually have?¶

scontrol show job <jobid>      # full record: nodes, GRES, walltime, env
sstat -j <jobid> --format=JobID,AveCPU,AveRSS,MaxRSS  # live stats while running

MaxRSS is the peak memory used so far — useful for right-sizing --mem on the next submit.

Past usage — yours¶

sacct -u $USER -S $(date -d '7 days ago' +%F) \
      --format=JobID,JobName,State,Elapsed,ReqTRES%30,AllocTRES%30

Elapsed vs walltime tells you whether you over-requested time. MaxRSS (add MaxRSS%10) tells you whether you over-requested memory. Right-sizing future jobs improves how fast they schedule.

Your lab at a glance¶

sinfo -p <your-lab> -o "%P %a %l %D %t"   # partition / avail / max time / nodes / state

A useful quick-glance: how many of your lab's nodes are up, drained, or fully booked, and the partition's max walltime.

Useful commands¶

sbatch <script>              # submit
squeue -u $USER              # your jobs
squeue -p <your-lab>         # your lab's queue
scancel <jobid>              # cancel
scancel -u $USER             # cancel all of yours
sacct -u $USER -S today      # history
sacct -j <jobid> --format=JobID,State,Elapsed,MaxRSS,AllocTRES
scontrol show job <jobid>    # full details

Directives you'll use often¶

Directive	What it does
`--job-name=<x>`	Name shown in `squeue`
`--partition=<lab>`	Which lab's queue
`--time=HH:MM:SS`	Walltime (required; node is reclaimed at expiry)
`--cpus-per-task=N`	CPU cores
`--mem=N[G]`	RAM
`--gres=gpu:N`	Whole GPUs
`--gres=shard:N`	Fractional GPU (if your lab has MPS enabled)
`--output=<path>`	stdout file. `%x` = job name, `%j` = job ID
`--error=<path>`	stderr file; omit to merge with stdout
`--array=1-100`	Array job — run the script 100 times with `$SLURM_ARRAY_TASK_ID`
`--dependency=afterok:<jobid>`	Run only if an earlier job succeeded
`--mail-type=END,FAIL`	Email when job ends/fails
`--mail-user=you@umd.edu`	Where to mail

Interactive session without the desktop¶

Sometimes you just want a shell on a GPU node (e.g. to debug, run a REPL):

srun --partition=<your-lab> --time=01:00:00 \
     --cpus-per-task=4 --mem=16G --gres=gpu:1 \
     --pty bash -l

You'll land in a shell on one of your lab's compute nodes, with the GPU reserved for you. Exit the shell to release it.

Arrays — parallel sweep¶

Running the same thing with a parameter sweep:

#SBATCH --job-name=sweep
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=4 --mem=8G --gres=gpu:1
#SBATCH --output=logs/sweep-%a.out

python train.py --lr 0.00$SLURM_ARRAY_TASK_ID

sbatch returns one job ID but Slurm actually queues 10 tasks, each with its own $SLURM_ARRAY_TASK_ID (1 through 10). The scheduler runs them in parallel up to the queue's capacity.

Useful environment variables inside a job¶

Variable	What
`$SLURM_JOB_ID`	Job ID
`$SLURM_ARRAY_TASK_ID`	Array index (arrays only)
`$SLURM_CPUS_PER_TASK`	Cores you got
`$SLURM_GPUS_ON_NODE`	GPUs allocated
`$TMPDIR`	`/scratch/<jobid>` on the node

What Slurm sees¶

Running: the job is using the allocation.
Pending (PD) with reason Resources or Priority: queued, waiting for a labmate's job to finish or yield priority.
Pending with ReqNodeNotAvail: every node in your lab's set is drained or reserved; email eit-help@umd.edu.
Completing (CG): finishing cleanup; usually seconds.
Failed (F) / Cancelled (CA): you can tell from sacct.

Checking a failed job¶

sacct -j <jobid> --format=JobID,State,ExitCode,Reason,NodeList
# Then look at the output log you specified with --output
cat /mnt/lab-research/logs/<jobname>-<jobid>.out

Common failure patterns:

Exit code 137 — OOM killed. Increase --mem.
Exit code 124 — Hit walltime. Increase --time.
Exit code non-zero, "CUDA out of memory" in the log — Your model is too big for one GPU or a labmate is also on the same GPU (MPS); bump --mem, reduce batch size, or request a whole GPU.

Running a quick job from the OOD shell¶

In OOD → Clusters → Research Slurm Shell Access, one-liner:

sbatch --wrap="module load miniconda3 && conda activate pytorch && python -c 'import torch; print(torch.cuda.is_available())'" \
       --partition=<your-lab> --time=00:05:00 --gres=gpu:1 --mem=4G

Then squeue -u $USER to see it run.

Output locations¶

Point --output at lab storage, not ~. If you forget, you'll get a log file in the job's submission directory (wherever you ran sbatch), which may or may not be on persistent storage.