Slurm Command-Line (Batch Jobs)¶
Slurm runs your job on one of your lab's GPU nodes. The partition, the queue, the nodes, and everyone else in the queue are all your own lab — there's no shared campus pool to compete against.
If you don't need a graphical session, or you want to run something
long-running and come back later, submit a batch job via
sbatch. You can do this from:
- Any desktop session (open a terminal).
- OOD → Clusters → Research Slurm Shell Access in the portal (no full desktop needed, quick CLI access — lands you on the Slurm submit host).
- Direct SSH (see direct-ssh.md).
One-minute intro¶
A batch job is a shell script with special comment lines at the top telling Slurm what resources you need:
#!/bin/bash
#SBATCH --job-name=my-training
#SBATCH --partition=<your-lab> # e.g. lincheng or inspire
#SBATCH --time=04:00:00 # hh:mm:ss
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:1
#SBATCH --output=/mnt/lab-research/logs/%x-%j.out
module load miniconda3
conda activate pytorch
cd /mnt/lab-research/my-project
python train.py --epochs 50
Save as run.sh, then:
Check on it:
Cancel it:
Picking a partition¶
Your lab has its own partition, named after the lab, made up of the nodes your lab owns. You can only submit to your own lab's partition — Slurm rejects cross-lab submissions:
sinfo # lists partitions you have access to
sinfo -p <your-lab> -N -l # the nodes that make up your lab's partition
Other labs' partitions exist on the same Slurm controller, but their nodes are theirs. You're not competing for their resources, and they aren't competing for yours.
Checking what's available right now¶
Before you launch a desktop or submit a long batch job, it helps to see whether nodes are idle (you'll start immediately), mixed (some capacity left), or fully allocated (you'll queue).
Are my lab's nodes free?¶
The STATE column tells you:
| State | What it means for you |
|---|---|
idle |
Node is empty — your job will start immediately. |
mix |
Node has capacity for more jobs (oversubscribe — see below). |
alloc |
Node is fully booked at the OverSubscribe limit; new jobs queue. |
drain/drng |
Node is being prepared for maintenance — won't take new jobs. |
down/fail |
Node is broken; ignore. Ask IT if it stays this way. |
Lab partitions oversubscribe (OverSubscribe=FORCE:5), so up to ~5
jobs can land on the same physical node. mix is the normal steady
state — you can still launch.
Resource detail per node¶
sinfo -p <your-lab> -N -o "%N %t %C %m %G %E"
# │ │ │ │ │ └─ reason if drained/down
# │ │ │ │ └──── GRES (e.g. gpu:4)
# │ │ │ └─────── memory (MB)
# │ │ └────────── CPUs (Allocated/Idle/Other/Total)
# │ └───────────── state
# └──────────────── node name
CPUs 64/0/0/64 means all 64 cores are allocated. 0/64/0/64 means
all idle. Same shape for memory and GRES.
What's running on a node?¶
squeue -w <node-fqdn> # jobs currently on a specific node
squeue -p <your-lab> # everything in your lab's queue
squeue -p <your-lab> -t R # only running jobs
squeue -p <your-lab> -t PD # only pending — see why below
The USER column shows whose job it is — and since the partition is
just your lab, everyone you see is a labmate. When the queue is
full this is also how you find out which labmate to politely poke
about that overnight job they forgot to cancel.
Why is my queued job not starting?¶
squeue -u $USER -t PD -o "%i %j %T %r %S"
# │ │ │ │ └─ estimated start time (if known)
# │ │ │ └──── reason
# │ │ └─────── state
# │ └────────── job name
# └───────────── job ID
Common reasons in the REASON column:
| Reason | What to do |
|---|---|
Resources |
Nodes match your request but are full. Wait, or ask for fewer resources. |
Priority |
Another job is ahead of you. Will start when it does. |
ReqNodeNotAvail |
All matching nodes are drained / reserved. Ask IT. |
JobHeldUser / JobHeldAdmin |
You or an admin held the job. scontrol release <jobid>. |
AssocGrpCPURunMinutesLimit (etc.) |
You're at an account limit. Talk to your PI. |
What does my job actually have?¶
scontrol show job <jobid> # full record: nodes, GRES, walltime, env
sstat -j <jobid> --format=JobID,AveCPU,AveRSS,MaxRSS # live stats while running
MaxRSS is the peak memory used so far — useful for right-sizing
--mem on the next submit.
Past usage — yours¶
sacct -u $USER -S $(date -d '7 days ago' +%F) \
--format=JobID,JobName,State,Elapsed,ReqTRES%30,AllocTRES%30
Elapsed vs walltime tells you whether you over-requested time.
MaxRSS (add MaxRSS%10) tells you whether you over-requested memory.
Right-sizing future jobs improves how fast they schedule.
Your lab at a glance¶
A useful quick-glance: how many of your lab's nodes are up, drained, or fully booked, and the partition's max walltime.
Useful commands¶
sbatch <script> # submit
squeue -u $USER # your jobs
squeue -p <your-lab> # your lab's queue
scancel <jobid> # cancel
scancel -u $USER # cancel all of yours
sacct -u $USER -S today # history
sacct -j <jobid> --format=JobID,State,Elapsed,MaxRSS,AllocTRES
scontrol show job <jobid> # full details
Directives you'll use often¶
| Directive | What it does |
|---|---|
--job-name=<x> |
Name shown in squeue |
--partition=<lab> |
Which lab's queue |
--time=HH:MM:SS |
Walltime (required; node is reclaimed at expiry) |
--cpus-per-task=N |
CPU cores |
--mem=N[G] |
RAM |
--gres=gpu:N |
Whole GPUs |
--gres=shard:N |
Fractional GPU (if your lab has MPS enabled) |
--output=<path> |
stdout file. %x = job name, %j = job ID |
--error=<path> |
stderr file; omit to merge with stdout |
--array=1-100 |
Array job — run the script 100 times with $SLURM_ARRAY_TASK_ID |
--dependency=afterok:<jobid> |
Run only if an earlier job succeeded |
--mail-type=END,FAIL |
Email when job ends/fails |
--mail-user=you@umd.edu |
Where to mail |
Interactive session without the desktop¶
Sometimes you just want a shell on a GPU node (e.g. to debug, run a REPL):
srun --partition=<your-lab> --time=01:00:00 \
--cpus-per-task=4 --mem=16G --gres=gpu:1 \
--pty bash -l
You'll land in a shell on one of your lab's compute nodes, with the GPU reserved for you. Exit the shell to release it.
Arrays — parallel sweep¶
Running the same thing with a parameter sweep:
#SBATCH --job-name=sweep
#SBATCH --array=1-10
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=4 --mem=8G --gres=gpu:1
#SBATCH --output=logs/sweep-%a.out
python train.py --lr 0.00$SLURM_ARRAY_TASK_ID
sbatch returns one job ID but Slurm actually queues 10 tasks, each
with its own $SLURM_ARRAY_TASK_ID (1 through 10). The scheduler
runs them in parallel up to the queue's capacity.
Useful environment variables inside a job¶
| Variable | What |
|---|---|
$SLURM_JOB_ID |
Job ID |
$SLURM_ARRAY_TASK_ID |
Array index (arrays only) |
$SLURM_CPUS_PER_TASK |
Cores you got |
$SLURM_GPUS_ON_NODE |
GPUs allocated |
$TMPDIR |
/scratch/<jobid> on the node |
What Slurm sees¶
- Running: the job is using the allocation.
- Pending (PD) with reason
ResourcesorPriority: queued, waiting for a labmate's job to finish or yield priority. - Pending with
ReqNodeNotAvail: every node in your lab's set is drained or reserved; email eit-help@umd.edu. - Completing (CG): finishing cleanup; usually seconds.
- Failed (F) / Cancelled (CA): you can tell from
sacct.
Checking a failed job¶
sacct -j <jobid> --format=JobID,State,ExitCode,Reason,NodeList
# Then look at the output log you specified with --output
cat /mnt/lab-research/logs/<jobname>-<jobid>.out
Common failure patterns:
- Exit code 137 — OOM killed. Increase
--mem. - Exit code 124 — Hit walltime. Increase
--time. - Exit code non-zero, "CUDA out of memory" in the log — Your
model is too big for one GPU or a labmate is also on the same
GPU (MPS); bump
--mem, reduce batch size, or request a whole GPU.
Running a quick job from the OOD shell¶
In OOD → Clusters → Research Slurm Shell Access, one-liner:
sbatch --wrap="module load miniconda3 && conda activate pytorch && python -c 'import torch; print(torch.cuda.is_available())'" \
--partition=<your-lab> --time=00:05:00 --gres=gpu:1 --mem=4G
Then squeue -u $USER to see it run.
Output locations¶
Point --output at lab storage, not ~. If you forget, you'll
get a log file in the job's submission directory (wherever you ran
sbatch), which may or may not be on persistent storage.