You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimizing nnU-Net Training: Batch Size vs. GPU Utilization
Hey team,
We've been facing issues with CUDA memory allocation and SM utilization on our 8x NVIDIA L4 (16GB) GPUs while training nnU-Net (3D Full-Res). Some key challenges include:
1️⃣ "Not enough SMs to use max_autotune_gemm mode" errors.
2️⃣ CUDA Out of Memory (OOM) when using 8 GPUs with large batch sizes.
3️⃣ Triton Kernel issues with torch.compile() optimizations.
While optimizing our nnU-Net training across multiple GPUs, we encountered a Distributed Data Parallel (DDP) assertion error:
assertglobal_batch_size>=world_size, 'Cannot run DDP if the batch size is smaller than the number of GPUs... Duh.'
🔍 Key Questions for Discussion:
1️⃣ Why is this assertion necessary?
DDP requires that the global batch size is at least equal to the number of GPUs (world_size).
Otherwise, some GPUs will receive zero samples, causing a crash.
2️⃣ What’s the best batch size per GPU?
Should we reduce per-GPU batch size to ensure stability?
Does mixed precision training (--use_mixed_precision) help?
3️⃣ Should we reduce the number of GPUs (8 → 4)?
Could training on fewer GPUs (4 instead of 8) improve stability?
💡 Proposed Fixes (So Far)
✅ Set export TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=0 to fix Triton kernel errors.
✅ Disable torch.compile() using export PYTORCH_NO_TORCH_COMPILE=1.
✅ Try batch size = 4 per GPU (total 32 on 8 GPUs) for stable training.
✅ Use --use_mixed_precision to optimize memory.
Would love to hear your thoughts! What batch size/GPU setup has worked best for you? 🚀🔥
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Optimizing nnU-Net Training: Batch Size vs. GPU Utilization
Hey team,
We've been facing issues with CUDA memory allocation and SM utilization on our 8x NVIDIA L4 (16GB) GPUs while training nnU-Net (3D Full-Res). Some key challenges include:
1️⃣ "Not enough SMs to use max_autotune_gemm mode" errors.
2️⃣ CUDA Out of Memory (OOM) when using 8 GPUs with large batch sizes.
3️⃣ Triton Kernel issues with torch.compile() optimizations.
While optimizing our nnU-Net training across multiple GPUs, we encountered a Distributed Data Parallel (DDP) assertion error:
🔍 Key Questions for Discussion:
1️⃣ Why is this assertion necessary?
world_size
).2️⃣ What’s the best batch size per GPU?
--use_mixed_precision
) help?3️⃣ Should we reduce the number of GPUs (8 → 4)?
💡 Proposed Fixes (So Far)
✅ Set
export TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=0
to fix Triton kernel errors.✅ Disable
torch.compile()
usingexport PYTORCH_NO_TORCH_COMPILE=1
.✅ Try batch size = 4 per GPU (total 32 on 8 GPUs) for stable training.
✅ Use
--use_mixed_precision
to optimize memory.Would love to hear your thoughts! What batch size/GPU setup has worked best for you? 🚀🔥
Looking forward to your input!
- Amal Shehu
Beta Was this translation helpful? Give feedback.
All reactions