Python, Conda, and GPU ML Frameworks¶
For research Python and GPU ML workflows, start with the curated
jupyter-gpu module. It's a maintained stack — PyTorch +
TensorFlow + CuPy + scientific Python — built against the exact CUDA
runtime/driver pair the nodes ship with. Building your own conda env
from scratch is supported but you'll spend more time getting CUDA
versions to line up than doing actual research.
The fast path: module load jupyter-gpu¶
module load jupyter-gpu # default version
module avail jupyter-gpu # list other versions if pinned for reproducibility
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Use this from any terminal — desktop session, OOD shell, SSH. Inside a notebook, OOD's Jupyter Interactive App preselects this stack for you.
If you need a package the module doesn't include, you can
pip install --user <pkg> on top of it (the module sets
PYTHONNOUSERSITE=0 so user-site installs are honoured); just be
aware your additions live in ~/.local/lib/... and disappear with
the rest of your home at session end. For anything you want to keep,
build a custom conda env on lab storage (below).
When to roll your own conda env¶
Stick to a custom env only when:
- You need a specific framework version the curated module doesn't ship.
- You're pinning a paper-replication environment.
- You're using a niche tool that wants its own dependency stack.
Otherwise, every hour spent fighting pytorch-cuda versions is an
hour not spent on your research. Talk to IT (eit-help@umd.edu) if
you think the curated stack is missing something common — adding it
once helps everyone.
Building your own env on lab storage¶
Conda envs are large (hundreds of MB to a few GB). Put them on your
lab share, not in ~ (local to whichever node you land on — not
guaranteed to follow you) or /scratch (wiped per job).
Set up conda once¶
module load miniconda3
# Tell conda where to put envs and package cache (one-time, lives in
# ~/.condarc — but ~/.condarc lives on the node-local home, so if
# Slurm gives you a different node next time, re-run this or use the
# symlinked-condarc pattern below)
conda config --add envs_dirs ~/lab-research/conda/envs
conda config --add pkgs_dirs ~/lab-research/conda/pkgs
# Verify
conda config --show envs_dirs pkgs_dirs
(Substitute ~/lab-research with your lab's symlink — see
your-lab-storage.md.)
A more robust pattern is to put the config on lab storage and symlink it in:
PyTorch + CUDA¶
Matching CUDA versions across PyTorch, TensorFlow, and the node
driver is the part that bites people. The curated jupyter-gpu
module sidesteps this; if you must build your own, use the cu128
wheel index to align with the same CUDA stack the nodes run:
module load miniconda3
conda create -n pytorch python=3.11 -y
conda activate pytorch
pip install --index-url https://download.pytorch.org/whl/cu128 \
torch torchvision torchaudio
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
Earlier CUDA wheel indexes (cu124, cu126) install successfully
but fail at import on the current node driver — easy to mistake for
a code bug. If torch.cuda.is_available() returns False, check that
you used cu128.
JAX + CUDA¶
module load miniconda3
conda create -n jax python=3.11 -y
conda activate jax
pip install "jax[cuda12]"
python -c "import jax; print(jax.devices())"
Reusing the env in later sessions¶
If conda activate can't find the env, your ~/.condarc is back to
defaults — re-run the envs_dirs setup or use the symlinked-condarc
pattern above.
Jupyter notebooks¶
The simplest path is OOD → Interactive Apps → Jupyter, which
launches jupyter-gpu on a GPU node and hands you the browser URL.
Inside the notebook the curated stack is already loaded; you don't
need to module load anything.
If you need a custom env as a Jupyter kernel:
module load miniconda3
conda activate <your-env>
pip install ipykernel
python -m ipykernel install --user --name <your-env> --display-name "<your-env>"
It'll appear in the New Notebook menu next session.
For a quick notebook from inside a desktop session:
Open the printed URL in the in-desktop Firefox.
GPU memory & multiple jobs¶
Each Slurm job gets an exclusive GPU allocation by default. Lab
partitions oversubscribe (OverSubscribe=FORCE:5) so several jobs
land on the same physical GPU concurrently; CUDA's MPS handles the
isolation. If your lab's GPUs have MPS enabled you can also request
a fraction with --gres=shard:1 in a batch job.
Inside a job, only the allocated GPUs are visible via
CUDA_VISIBLE_DEVICES. Don't try to override it — Slurm handles the
isolation.
Common gotchas¶
| Problem | Cause / fix |
|---|---|
torch.cuda.is_available() returns False after a custom build |
Wrong CUDA wheel index. Use cu128 (see PyTorch section above), or just module load jupyter-gpu. |
conda create eats 10 GB and fills /home |
envs_dirs not set; envs are landing in ~. Re-run the one-time config and conda env remove -n <name> to clean up. |
CUDA out of memory |
GPU is shared (MPS). nvidia-smi shows who's on it; reduce batch size, request a whole GPU, or wait. |
ModuleNotFoundError after a fresh session |
You forgot conda activate. Put it in your shell startup if you always want the same env. |
pip install fails with SSL errors |
Transient PyPI hiccup; retry. Persistent: email eit-help@umd.edu. |
Mamba¶
mamba is a faster drop-in for conda install, included in the
miniconda3 module:
Sharing an env with a labmate¶
Export:
They recreate:
--from-history produces a cleaner file than conda env export
alone; the latter captures every transitive package and is often not
cross-platform.
Don't do this¶
- Don't put a conda env in
~or/scratch— gone at session end. - Don't run
sudo pip install— you don't have root, and you shouldn't anyway. - Don't install conda itself — the shared
miniconda3module is the supported path. - Don't re-implement the
jupyter-gpustack from scratch unless you have a real reason to. Load it; tell IT what you wish it had; build a custom env only if there's something the curated stack genuinely can't do.