code will get stuck when using ddp #1922

cunangjiang · 2025-01-16T11:47:59Z

cunangjiang
Jan 16, 2025

When using ddp for diff_model_train.py, the code will get stuck at a certain epoch. When I reduce the dataset, the model runs in more epochs. How can i solve this problem? For example, as shown in the figure, the code will remain stuck here without any errors.

cunangjiang · 2025-01-16T11:54:34Z

cunangjiang
Jan 16, 2025
Author

0 replies

cunangjiang · 2025-01-16T23:56:25Z

cunangjiang
Jan 16, 2025
Author

The GPU memory has remaining space, but the GPU utilization is 0.

0 replies

KumoLiu · 2025-01-17T10:52:44Z

KumoLiu
Jan 17, 2025
Maintainer

Hi @cunangjiang, did you check the RAM?

0 replies

cunangjiang · 2025-01-17T13:14:45Z

cunangjiang
Jan 17, 2025
Author

i use free -h to check the RAM, here is the result

0 replies

cunangjiang · 2025-01-17T13:15:15Z

cunangjiang
Jan 17, 2025
Author

@KumoLiu

0 replies

KumoLiu · 2025-01-17T14:25:02Z

KumoLiu
Jan 17, 2025
Maintainer

Could you please share the command you are using to run the training? And did you using the latest version? It's not easy to reproduce and figure out the issue.

0 replies

cunangjiang · 2025-01-21T02:33:17Z

cunangjiang
Jan 21, 2025
Author

Yes, because I want to achieve my task based on maisi, I rewrite the diff_model_train.py in the scripts after adding bigger datasets. Now I found that if I use the "python train_iffunet.py" command to start diff_madel_train.exe through the run_terchrun function, the code will get stuck at a certain epoch. If I use the "python - m torch.distributed-launch -- nproc_per_node 4 -- use_dev diff_model_train.py" command to directly start diffhmodel_train.exe, the code can run normally. May I ask why this is??

1 reply

KumoLiu Jan 21, 2025
Maintainer

Perhaps you can refer more detials here: https://discuss.pytorch.org/t/difference-between-torch-distributed-and-python-run/150494/3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code will get stuck when using ddp #1922

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

code will get stuck when using ddp #1922

cunangjiang Jan 16, 2025

Replies: 7 comments · 1 reply

cunangjiang Jan 16, 2025 Author

cunangjiang Jan 16, 2025 Author

KumoLiu Jan 17, 2025 Maintainer

cunangjiang Jan 17, 2025 Author

cunangjiang Jan 17, 2025 Author

KumoLiu Jan 17, 2025 Maintainer

cunangjiang Jan 21, 2025 Author

KumoLiu Jan 21, 2025 Maintainer

cunangjiang
Jan 16, 2025

Replies: 7 comments 1 reply

cunangjiang
Jan 16, 2025
Author

cunangjiang
Jan 16, 2025
Author

KumoLiu
Jan 17, 2025
Maintainer

cunangjiang
Jan 17, 2025
Author

cunangjiang
Jan 17, 2025
Author

KumoLiu
Jan 17, 2025
Maintainer

cunangjiang
Jan 21, 2025
Author

KumoLiu Jan 21, 2025
Maintainer