code will get stuck when using ddp #1922
Replies: 7 comments 1 reply
-
The GPU memory has remaining space, but the GPU utilization is 0. |
Beta Was this translation helpful? Give feedback.
-
Hi @cunangjiang, did you check the RAM? |
Beta Was this translation helpful? Give feedback.
-
Could you please share the command you are using to run the training? And did you using the latest version? It's not easy to reproduce and figure out the issue. |
Beta Was this translation helpful? Give feedback.
-
Yes, because I want to achieve my task based on maisi, I rewrite the diff_model_train.py in the scripts after adding bigger datasets. Now I found that if I use the "python train_iffunet.py" command to start diff_madel_train.exe through the run_terchrun function, the code will get stuck at a certain epoch. If I use the "python - m torch.distributed-launch -- nproc_per_node 4 -- use_dev diff_model_train.py" command to directly start diffhmodel_train.exe, the code can run normally. May I ask why this is?? |
Beta Was this translation helpful? Give feedback.
-
When using ddp for diff_model_train.py, the code will get stuck at a certain epoch. When I reduce the dataset, the model runs in more epochs. How can i solve this problem? For example, as shown in the figure, the code will remain stuck here without any errors.
Beta Was this translation helpful? Give feedback.
All reactions