Distributed training on multiple GPUs get stucked. #529

NotoCJ · 2024-12-30T10:01:26Z

Hello,
I was training with RT-DETRv2 with multiple GPUs. The training got stucked when a batch of data on a GPU are all background images (i.e., no ground-truth/target boxes), the volatile gpu-util became 100%, and there were no errors or warnings reported. The training was just stucked there and did not go on. In such case, the VFL loss is very small (0.0528), l1 and giou box loss are all 0. The training was stucked in the scaler.backward or scaler.step (or scaler.update maybe) process. When there are some target boxes in a batch data, the training is fine. By the way, when training with gloo backend, it was stucked in the same occasion, and the following error was reported:

Thanks for attention!

lyuwenyu · 2025-01-15T08:54:48Z

You can filter relevant samples before training
or try loss_box = pred_box *= 0 in that case.

NotoCJ · 2025-01-20T06:24:23Z

Thanks. But the loss_bbox is already 0 in such case (or maybe i misunderstand?), and i also want to train with these background images to prevent wrong detections.
I guess this could be a data mismatch problem between different GPUs, according to the error reported when using gloo backend. I found that when there are no target boxes, the "dn_aux_outputs" in model output will not be defined, then dn related loss items will not be defined. I tried to add these loss items with zero values and no grad. Besides, the find_unused_parameters is set as True when setting DDP mode to prevent DDP waiting for these unused parameters. These operations solved the stucking problem during training in my case.
I'm not sure this is a common problem for such case in which there are many background images, and also the above solution may not be a accurate or good one cause it increased a little bit training time.

bryan-pakulski · 2025-01-26T06:22:00Z

@NotoCJ any chance you could post the code changes you made?

I believe I have some similar issues, large dataset with many background images

I've modified the loss values and also set find_unused_parameters to True but I haven't been able to have as much luck as you with actually getting training to work for this dataset on multi-gpu

NotoCJ assigned lyuwenyu Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training on multiple GPUs get stucked. #529

Distributed training on multiple GPUs get stucked. #529

NotoCJ commented Dec 30, 2024

lyuwenyu commented Jan 15, 2025

NotoCJ commented Jan 20, 2025

bryan-pakulski commented Jan 26, 2025

Distributed training on multiple GPUs get stucked. #529

Distributed training on multiple GPUs get stucked. #529

Comments

NotoCJ commented Dec 30, 2024

lyuwenyu commented Jan 15, 2025

NotoCJ commented Jan 20, 2025

bryan-pakulski commented Jan 26, 2025