You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This timeout happens at the last step of the training (I confirmed that the last checkpoint was not created). When I sampled only 1K records (out of 70K records) and train with them, I found the error dissapeared and the training finished without any errors (which is very strange to me...).
0%| | 0/101 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
99%|█████████▉| 100/101 [3:04:16<01:50, 110.64s/it][rank2]:[E122 07:47:07.196390673 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100368, OpType=_ALLGATHER_BASE, NumelIn=65667072, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
10.0.29.212: [rank2]:[E122 07:47:07.196986688 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 100368, last enqueued NCCL work: 100370, last completed NCCL work: 100367.
10.0.29.212: [rank2]:[E122 07:47:07.197021650 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 100368, last enqueued NCCL work: 100370, last completed NCCL work: 100367.
10.0.29.212: [rank2]:[E122 07:47:07.197029051 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
10.0.29.212: [rank2]:[E122 07:47:07.197032661 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
10.0.29.212: [rank3]:[E122 07:47:07.197858759 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100368, OpType=_ALLGATHER_BASE, NumelIn=65667072, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
10.0.29.212: [rank2]:[E122 07:47:07.198108174 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100368, OpType=_ALLGATHER_BASE, NumelIn=65667072, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
10.0.29.212: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
10.0.29.212: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ccd838b9446 in /home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/lib/libc10.so)
10.0.29.212: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ccd38bcc772 in /home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
10.0.29.212: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ccd38bd3bb3 in /home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
10.0.29.212: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ccd38bd561d in /home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
10.0.29.212: frame #4: <unknown function> + 0x145c0 (0x7ccd842f95c0 in /home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/lib/libtorch.so)
10.0.29.212: frame #5: <unknown function> + 0x94ac3 (0x7ccda4694ac3 in /lib/x86_64-linux-gnu/libc.so.6)
10.0.29.212: frame #6: <unknown function> + 0x126850 (0x7ccda4726850 in /lib/x86_64-linux-gnu/libc.so.6)
10.0.29.212:
Also I found a weird part in the same log file below. I passed my dataset and am using packing=True. I expected only one "Generating train split: <NUM_PACKED_SAMPLES> examples" message and the dataset is shared across all the rank (8 gpu *2 node =16 ranks) in deepspeed Zero3 training.
Reproduction
Here is deepspeed laucher for 2 nodes of p5.48xlarge (I am using slurm, but I do not think it matters).
Here is a simplified train.py file.
output:
errors:
This timeout happens at the last step of the training (I confirmed that the last checkpoint was not created). When I sampled only 1K records (out of 70K records) and train with them, I found the error dissapeared and the training finished without any errors (which is very strange to me...).
Also I found a weird part in the same log file below. I passed my dataset and am using packing=True. I expected only one "Generating train split: <NUM_PACKED_SAMPLES> examples" message and the dataset is shared across all the rank (8 gpu *2 node =16 ranks) in deepspeed Zero3 training.
System Info
Checklist
The text was updated successfully, but these errors were encountered: