Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the process loading the data killed? #157

Open
fuyuchenIfyw opened this issue Dec 12, 2022 · 3 comments
Open

Why is the process loading the data killed? #157

fuyuchenIfyw opened this issue Dec 12, 2022 · 3 comments

Comments

@fuyuchenIfyw
Copy link

Describe the bug
Hello, I met a bug about cacheDataset when I follow the training way provided by the research-contributions/DiNTS/train_multi-gpu.py. I used the MSD 03 liver dataset, when using the cachedataset to load the data, I encountered this problem in the middle of the loading:
`WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239172 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239173 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 239174) of binary: /home/fyc/anaconda3/envs/cv/bin/python
Traceback (most recent call last):
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_multi-gpu.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-12-12_16:06:40
host : amax
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 239174)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 239174
=======================================================`

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'research-contributions/DiNTS'
  2. Install dependencies
  3. Run commands 'bash run_train_multi-gpu.sh'
    I did'nt follow the README.md to use docker, I use conda env instead.

Expected behavior
When I set cache rate of cachedataset to 0.0, it can be trained normally. But when I set cache rate to 1.0, the error will occur.

Screenshots
image

Environment (please complete the following information):

  • OS Ubuntu 18.04.5 LTS
  • Python version 3.8
  • MONAI version 1.0.1
  • CUDA/cuDNN version 11.7
  • GPU models and configuration 3 RTX 3090

Additional context
Add any other context about the problem here.

@Jamshidhsp
Copy link

Hi @fuyuchenIfyw,
I have the same problem. Could you manage to figure it out?

@fuyuchenIfyw
Copy link
Author

fuyuchenIfyw commented Apr 1, 2023 via email

@ancia290
Copy link

ancia290 commented Apr 4, 2024

I have the same problem.

loading about 300 3d images i got the following error

monai.transforms.croppad.dictionary CropForegroundd.init:allow_smaller: Current default value of argument allow_smaller=True has been deprecated since version 1.2. It will be changed to allow_smaller=False in version 1.5.
Loading dataset: 80%|████████████████████▋ | 207/260 [27:33<09:36, 10.87s/it] Killed

Any idea ?????

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants