Python, Conda, and GPU ML Frameworks¶

For research Python and GPU ML workflows, start with the curated jupyter-gpu module. It's a maintained stack — PyTorch + TensorFlow + CuPy + scientific Python — built against the exact CUDA runtime/driver pair the nodes ship with. Building your own conda env from scratch is supported but you'll spend more time getting CUDA versions to line up than doing actual research.

The fast path: `module load jupyter-gpu`¶

module load jupyter-gpu        # default version
module avail jupyter-gpu       # list other versions if pinned for reproducibility

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Use this from any terminal — desktop session, OOD shell, SSH. Inside a notebook, OOD's Jupyter Interactive App preselects this stack for you.

If you need a package the module doesn't include, you can pip install --user <pkg> on top of it (the module sets PYTHONNOUSERSITE=0 so user-site installs are honoured); just be aware your additions live in ~/.local/lib/... and disappear with the rest of your home at session end. For anything you want to keep, build a custom conda env on lab storage (below).

When to roll your own conda env¶

Stick to a custom env only when:

You need a specific framework version the curated module doesn't ship.
You're pinning a paper-replication environment.
You're using a niche tool that wants its own dependency stack.

Otherwise, every hour spent fighting pytorch-cuda versions is an hour not spent on your research. Talk to IT (eit-help@umd.edu) if you think the curated stack is missing something common — adding it once helps everyone.

Building your own env on lab storage¶

Conda envs are large (hundreds of MB to a few GB). Put them on your lab share, not in ~ (local to whichever node you land on — not guaranteed to follow you) or /scratch (wiped per job).

Set up conda once¶

module load miniconda3

# Tell conda where to put envs and package cache (one-time, lives in
# ~/.condarc — but ~/.condarc lives on the node-local home, so if
# Slurm gives you a different node next time, re-run this or use the
# symlinked-condarc pattern below)
conda config --add envs_dirs ~/lab-research/conda/envs
conda config --add pkgs_dirs ~/lab-research/conda/pkgs

# Verify
conda config --show envs_dirs pkgs_dirs

(Substitute ~/lab-research with your lab's symlink — see your-lab-storage.md.)

A more robust pattern is to put the config on lab storage and symlink it in:

mv ~/.condarc ~/lab-research/conda/condarc
ln -sf ~/lab-research/conda/condarc ~/.condarc

PyTorch + CUDA¶

Matching CUDA versions across PyTorch, TensorFlow, and the node driver is the part that bites people. The curated jupyter-gpu module sidesteps this; if you must build your own, use the cu128 wheel index to align with the same CUDA stack the nodes run:

module load miniconda3
conda create -n pytorch python=3.11 -y
conda activate pytorch

pip install --index-url https://download.pytorch.org/whl/cu128 \
    torch torchvision torchaudio

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Earlier CUDA wheel indexes (cu124, cu126) install successfully but fail at import on the current node driver — easy to mistake for a code bug. If torch.cuda.is_available() returns False, check that you used cu128.

JAX + CUDA¶

module load miniconda3
conda create -n jax python=3.11 -y
conda activate jax
pip install "jax[cuda12]"
python -c "import jax; print(jax.devices())"

Reusing the env in later sessions¶

module load miniconda3
conda activate pytorch    # works because envs_dirs is on lab storage

If conda activate can't find the env, your ~/.condarc is back to defaults — re-run the envs_dirs setup or use the symlinked-condarc pattern above.

Jupyter notebooks¶

The simplest path is OOD → Interactive Apps → Jupyter, which launches jupyter-gpu on a GPU node and hands you the browser URL. Inside the notebook the curated stack is already loaded; you don't need to module load anything.

If you need a custom env as a Jupyter kernel:

module load miniconda3
conda activate <your-env>
pip install ipykernel
python -m ipykernel install --user --name <your-env> --display-name "<your-env>"

It'll appear in the New Notebook menu next session.

For a quick notebook from inside a desktop session:

module load jupyter-gpu
jupyter lab --no-browser --port 8888 --ip 127.0.0.1

Open the printed URL in the in-desktop Firefox.

GPU memory & multiple jobs¶

Each Slurm job gets an exclusive GPU allocation by default. Lab partitions oversubscribe (OverSubscribe=FORCE:5) so several jobs land on the same physical GPU concurrently; CUDA's MPS handles the isolation. If your lab's GPUs have MPS enabled you can also request a fraction with --gres=shard:1 in a batch job.

Inside a job, only the allocated GPUs are visible via CUDA_VISIBLE_DEVICES. Don't try to override it — Slurm handles the isolation.

Common gotchas¶

Problem	Cause / fix
`torch.cuda.is_available()` returns False after a custom build	Wrong CUDA wheel index. Use `cu128` (see PyTorch section above), or just `module load jupyter-gpu`.
`conda create` eats 10 GB and fills `/home`	`envs_dirs` not set; envs are landing in `~`. Re-run the one-time config and `conda env remove -n <name>` to clean up.
`CUDA out of memory`	GPU is shared (MPS). `nvidia-smi` shows who's on it; reduce batch size, request a whole GPU, or wait.
`ModuleNotFoundError` after a fresh session	You forgot `conda activate`. Put it in your shell startup if you always want the same env.
`pip install` fails with SSL errors	Transient PyPI hiccup; retry. Persistent: email eit-help@umd.edu.

Mamba¶

mamba is a faster drop-in for conda install, included in the miniconda3 module:

module load miniconda3
mamba install numpy pandas scipy

Export:

conda env export --from-history -n pytorch > ~/lab-research/envs/pytorch.yml

They recreate:

conda env create -f ~/lab-research/envs/pytorch.yml

--from-history produces a cleaner file than conda env export alone; the latter captures every transitive package and is often not cross-platform.

Don't do this¶

Don't put a conda env in ~ or /scratch — gone at session end.
Don't run sudo pip install — you don't have root, and you shouldn't anyway.
Don't install conda itself — the shared miniconda3 module is the supported path.
Don't re-implement the jupyter-gpu stack from scratch unless you have a real reason to. Load it; tell IT what you wish it had; build a custom env only if there's something the curated stack genuinely can't do.