RuntimeError: Address already in use #149

Willy0919 · 2020-07-11T13:30:28Z

When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False)
Process group URL: tcp://127.0.0.1:50152
Traceback (most recent call last):
File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in
args=(args,),
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch
daemon=False,
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker
raise e
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker
backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

The text was updated successfully, but these errors were encountered:

tianzhi0549 · 2020-07-11T13:37:09Z

@Willy0919 Please manually specify --dist-url (with a different port) in the training command line.

Willy0919 · 2020-07-14T06:02:02Z

@tianzhi0549 I have done this as described:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

but it did not work.

tianzhi0549 · 2020-07-14T15:53:06Z

Please place --dist-url tcp://127.0.0.1:50001 before options, for example.

python tools/train_net.py \
--config-file configs/FCOS-Detection/R_50_1x.yaml \
--dist-url tcp://127.0.0.1:50001 \
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real

Ziyan0829 · 2021-12-21T01:06:56Z

Hi, I meet the same problem, have you solved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Address already in use #149

RuntimeError: Address already in use #149

Willy0919 commented Jul 11, 2020

tianzhi0549 commented Jul 11, 2020

Willy0919 commented Jul 14, 2020

tianzhi0549 commented Jul 14, 2020

Ziyan0829 commented Dec 21, 2021

RuntimeError: Address already in use #149

RuntimeError: Address already in use #149

Comments

Willy0919 commented Jul 11, 2020

tianzhi0549 commented Jul 11, 2020

Willy0919 commented Jul 14, 2020

tianzhi0549 commented Jul 14, 2020

Ziyan0829 commented Dec 21, 2021