-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: Default process group is not initialized #184
Comments
Bro, thanks for your help !! I got the same error"Default process group is not initialized" and my torch.version is also 1.6. |
In the same boat here, however, I have a bit more information for anyone that wants to do more digging. I've been using PyTorch 1.6 with detectron2 (9eb4831 as recommended) in order to train the However, as of last week, when my cloud computing provider forced a mandatory kernel update (see below), I am now getting the exact same error:
Reverting to PyTorch 1.5 of course fixes the issue (for the same reason described in the original question) since it simply changes the For a minimum working example, simply install AdelaiDet as described (you can even use the pre-built docker container provided -- just make sure to upgrade PyTorch to 1.6 and rebuild detectron2 + adet), download the coco 2017 dataset to the
NOTE: I tried using the Dockerfile to create the minimum working example, however, it doesn't work due to its use of the latest detectron2 version (ignoring the recommended version hash NOTE2: The same error persists with PyTorch 1.7 as well. Thanks in advance. UPDATE: This might be solved in light of this thread facebookresearch/detectron2#2174 ... I too was only using num_gpus==1 for training originally. Still not sure why it was working for so long though. |
As @ashariati mentioned, if gpu_num == 1 and num_machines == 1, then there is no point using |
When training BlendMask with gpu_num=1, torch.version=1.6
In adet/layers/conv_with_kaiming_uniform.py line44: get_norm(norm, out_channels)
In this function, when env.TORCH_VERSION > (1, 5), as I did in torch 1.6, nn.SycnBatchNorm is used.
However, when gpu_num == 1 and num_machines == 1, in detectron2/engine/lauch.py line41: world_size == 1
Then in line55 mp.spawn() , function _distributed_worker() is not executed, so does line71: dist.init_process_group()
Then we look back at the nn.SycnBatchNorm, when it is used, it will run_check_default_pg() to checks if the default ProcessGroup has been initialized, and without dist.init_process_group(), the check will not pass.
These cause the error:
"Default process group is not initialized"
AssertionError: Default process group is not initialized
Simplely change (1, 5) to (1, 6) in detectron2/layers/batch_norm.py line143 can solve the problem temporarily but is not a good way.
I am not sure to report this problem to AdelaiDet or to Detectron2, as I met it when I was training BlendMask, I decide to report it here.
The text was updated successfully, but these errors were encountered: