Debugging nnU-Net DDP: Ensuring Batch Size Matches GPU Count #2686

amalshehu · 2025-02-01T21:55:08Z

amalshehu
Feb 1, 2025

Optimizing nnU-Net Training: Batch Size vs. GPU Utilization

Hey team,

We've been facing issues with CUDA memory allocation and SM utilization on our 8x NVIDIA L4 (16GB) GPUs while training nnU-Net (3D Full-Res). Some key challenges include:
1️⃣ "Not enough SMs to use max_autotune_gemm mode" errors.
2️⃣ CUDA Out of Memory (OOM) when using 8 GPUs with large batch sizes.
3️⃣ Triton Kernel issues with torch.compile() optimizations.

While optimizing our nnU-Net training across multiple GPUs, we encountered a Distributed Data Parallel (DDP) assertion error:

assert global_batch_size >= world_size, 'Cannot run DDP if the batch size is smaller than the number of GPUs... Duh.'

🔍 Key Questions for Discussion:

1️⃣ Why is this assertion necessary?

DDP requires that the global batch size is at least equal to the number of GPUs (world_size).
Otherwise, some GPUs will receive zero samples, causing a crash.

2️⃣ What’s the best batch size per GPU?

Should we reduce per-GPU batch size to ensure stability?
Does mixed precision training (--use_mixed_precision) help?

3️⃣ Should we reduce the number of GPUs (8 → 4)?

Could training on fewer GPUs (4 instead of 8) improve stability?

💡 Proposed Fixes (So Far)

✅ Set export TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=0 to fix Triton kernel errors.
✅ Disable torch.compile() using export PYTORCH_NO_TORCH_COMPILE=1.
✅ Try batch size = 4 per GPU (total 32 on 8 GPUs) for stable training.
✅ Use --use_mixed_precision to optimize memory.

Would love to hear your thoughts! What batch size/GPU setup has worked best for you? 🚀🔥

Looking forward to your input!
- Amal Shehu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging nnU-Net DDP: Ensuring Batch Size Matches GPU Count #2686

{{title}}

Replies: 0 comments

Select a reply

Debugging nnU-Net DDP: Ensuring Batch Size Matches GPU Count #2686

amalshehu Feb 1, 2025

Optimizing nnU-Net Training: Batch Size vs. GPU Utilization

🔍 Key Questions for Discussion:

💡 Proposed Fixes (So Far)

Replies: 0 comments

amalshehu
Feb 1, 2025