You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:
Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False)
Process group URL: tcp://127.0.0.1:50152
Traceback (most recent call last):
File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in
args=(args,),
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch
daemon=False,
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker
raise e
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker
backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
The text was updated successfully, but these errors were encountered:
When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:
Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False)
Process group URL: tcp://127.0.0.1:50152
Traceback (most recent call last):
File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in
args=(args,),
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch
daemon=False,
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker
raise e
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker
backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
The text was updated successfully, but these errors were encountered: