Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Address already in use #149

Open
Willy0919 opened this issue Jul 11, 2020 · 4 comments
Open

RuntimeError: Address already in use #149

Willy0919 opened this issue Jul 11, 2020 · 4 comments

Comments

@Willy0919
Copy link

When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False)
Process group URL: tcp://127.0.0.1:50152
Traceback (most recent call last):
File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in
args=(args,),
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch
daemon=False,
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker
raise e
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker
backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

@tianzhi0549
Copy link
Member

@Willy0919 Please manually specify --dist-url (with a different port) in the training command line.

@Willy0919
Copy link
Author

@tianzhi0549 I have done this as described:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

but it did not work.

@tianzhi0549
Copy link
Member

Please place --dist-url tcp://127.0.0.1:50001 before options, for example.

python tools/train_net.py \
--config-file configs/FCOS-Detection/R_50_1x.yaml \
--dist-url tcp://127.0.0.1:50001 \
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real

@Ziyan0829
Copy link

Hi, I meet the same problem, have you solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants