Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Default process group is not initialized #184

Open
imdoublecats opened this issue Aug 13, 2020 · 3 comments
Open

AssertionError: Default process group is not initialized #184

imdoublecats opened this issue Aug 13, 2020 · 3 comments

Comments

@imdoublecats
Copy link

When training BlendMask with gpu_num=1, torch.version=1.6
In adet/layers/conv_with_kaiming_uniform.py line44: get_norm(norm, out_channels)
In this function, when env.TORCH_VERSION > (1, 5), as I did in torch 1.6, nn.SycnBatchNorm is used.
However, when gpu_num == 1 and num_machines == 1, in detectron2/engine/lauch.py line41: world_size == 1
Then in line55 mp.spawn() , function _distributed_worker() is not executed, so does line71: dist.init_process_group()
Then we look back at the nn.SycnBatchNorm, when it is used, it will run_check_default_pg() to checks if the default ProcessGroup has been initialized, and without dist.init_process_group(), the check will not pass.
These cause the error:
"Default process group is not initialized"
AssertionError: Default process group is not initialized

Simplely change (1, 5) to (1, 6) in detectron2/layers/batch_norm.py line143 can solve the problem temporarily but is not a good way.
I am not sure to report this problem to AdelaiDet or to Detectron2, as I met it when I was training BlendMask, I decide to report it here.

@Wei-i
Copy link

Wei-i commented Nov 5, 2020

Bro, thanks for your help !! I got the same error"Default process group is not initialized" and my torch.version is also 1.6.

@ashariati
Copy link

ashariati commented Dec 8, 2020

In the same boat here, however, I have a bit more information for anyone that wants to do more digging.

I've been using PyTorch 1.6 with detectron2 (9eb4831 as recommended) in order to train the MS_DLA_34_4x_syncbn_shared_towers_bn_head.yaml model for months now without issue.

However, as of last week, when my cloud computing provider forced a mandatory kernel update (see below), I am now getting the exact same error:

[12/08 20:47:06 adet.trainer]: Starting training from iteration 0
Traceback (most recent call last):
  File "tools/train_net.py", line 237, in <module>
    launch(
  File "/root/code/detectron2/detectron2/engine/launch.py", line 62, in launch
    main_func(*args)
  File "tools/train_net.py", line 231, in main
    return trainer.train()
  File "tools/train_net.py", line 113, in train
    self.train_loop(self.start_iter, self.max_iter)
  File "tools/train_net.py", line 102, in train_loop
    self.run_step()
  File "/root/code/detectron2/detectron2/engine/train_loop.py", line 216, in run_step
    loss_dict = self.model(data)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/code/adet/adet/modeling/one_stage_detector.py", line 46, in forward
    return super().forward(batched_inputs)
  File "/root/code/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 274, in forward
    features = self.backbone(images.tensor)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/code/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
    bottom_up_features = self.bottom_up(x)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/code/adet/adet/modeling/backbone/dla.py", line 302, in forward
    x = self.base_layer(x)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
    return _get_group_size(group)
  File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized

cat /var/log/dpkg.log returns:

ubuntu@host:~$ cat /var/log/dpkg.log
2020-12-02 06:47:25 startup archives unpack
2020-12-02 06:47:25 install linux-modules-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:25 status half-installed linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 install linux-image-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status half-installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 upgrade linux-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:27 status half-configured linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 upgrade linux-image-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:27 status half-configured linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 install linux-aws-5.4-headers-5.4.0-1030:all <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status half-installed linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:29 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:29 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 install linux-headers-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status half-installed linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 upgrade linux-headers-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:30 status half-configured linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status half-installed linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status half-installed linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:31 startup packages configure
2020-12-02 06:47:31 configure linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status installed linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 configure linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status installed linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 configure linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status installed linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 configure linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:48:24 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status half-configured linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status triggers-pending linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 configure linux-headers-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 configure linux-image-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 configure linux-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 trigproc linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:48:24 status half-configured linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:36 status installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-03 06:25:37 startup packages remove
2020-12-03 06:25:37 status installed linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:38 remove linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:38 status half-configured linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:40 status half-installed linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status installed linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 remove linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:42 status half-configured linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status half-installed linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 startup packages configure
2020-12-03 06:25:45 startup packages remove
2020-12-03 06:25:45 status installed linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:45 remove linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:45 status half-configured linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:45 status half-installed linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status not-installed linux-headers-5.4.0-1028-aws:amd64 <none>
2020-12-03 06:25:47 startup packages configure

Reverting to PyTorch 1.5 of course fixes the issue (for the same reason described in the original question) since it simply changes the nn.SyncBatchNorm implementation to NaiveSyncBatchNorm (you can also set this explicitly in the configuration file by setting NORM: naiveSyncBN). However, it would be great to get to the root cause.

For a minimum working example, simply install AdelaiDet as described (you can even use the pre-built docker container provided -- just make sure to upgrade PyTorch to 1.6 and rebuild detectron2 + adet), download the coco 2017 dataset to the datasets/ directory, and run the demo training example:

python3 tools/train_net.py --config-file configs/FCOS-Detection/FCOS_RT/MS_DLA_34_4x_syncbn_shared_towers_bn_head.yaml OUTPUT_DIR /tmp

NOTE: I tried using the Dockerfile to create the minimum working example, however, it doesn't work due to its use of the latest detectron2 version (ignoring the recommended version hash 9eb4831). Out of the box, you will first get a cv2 not installed error, followed by a libGL.so import error. Although fixing these will take you back to the original issue.

NOTE2: The same error persists with PyTorch 1.7 as well.

Thanks in advance.

UPDATE: This might be solved in light of this thread facebookresearch/detectron2#2174 ... I too was only using num_gpus==1 for training originally. Still not sure why it was working for so long though.

@mvdelt
Copy link

mvdelt commented Feb 10, 2021

As @ashariati mentioned, if gpu_num == 1 and num_machines == 1, then there is no point using SyncBatchNorm.
I'm not using AdelaiDet, so not very sure, but I guess you should have set cfg properly:
for example, set cfg.MODEL.CONDINST.MASK_BRANCH.NORM and cfg.MODEL.BASIS_MODULE.NORM
to "BN", which makes the output of the get_norm be BatchNorm2d, instead of nn.SyncBatchNorm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants