Why is the process loading the data killed? #157

fuyuchenIfyw · 2022-12-12T08:42:00Z

Describe the bug
Hello, I met a bug about cacheDataset when I follow the training way provided by the research-contributions/DiNTS/train_multi-gpu.py. I used the MSD 03 liver dataset, when using the cachedataset to load the data, I encountered this problem in the middle of the loading:
`WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239172 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239173 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 239174) of binary: /home/fyc/anaconda3/envs/cv/bin/python
Traceback (most recent call last):
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_multi-gpu.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-12-12_16:06:40
host : amax
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 239174)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 239174
=======================================================`

To Reproduce
Steps to reproduce the behavior:

Go to 'research-contributions/DiNTS'
Install dependencies
Run commands 'bash run_train_multi-gpu.sh'
I did'nt follow the README.md to use docker, I use conda env instead.

Expected behavior
When I set cache rate of cachedataset to 0.0, it can be trained normally. But when I set cache rate to 1.0, the error will occur.

Screenshots

Environment (please complete the following information):

OS Ubuntu 18.04.5 LTS
Python version 3.8
MONAI version 1.0.1
CUDA/cuDNN version 11.7
GPU models and configuration 3 RTX 3090

Additional context
Add any other context about the problem here.

Jamshidhsp · 2023-03-31T14:59:55Z

Hi @fuyuchenIfyw,
I have the same problem. Could you manage to figure it out?

fuyuchenIfyw · 2023-04-01T04:49:37Z

I'm sorry that I ultimately didn't solve the problem. I believe it may have been due to insufficient hardware resources. In the end, I gave up on using MONAI and instead opted to develop with the GitHub - MIC-DKFZ/nnUNet framework.  Echo ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2023年3月31日(星期五) 晚上11:00 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [Project-MONAI/research-contributions] Why is the process loading the data killed? (Issue #157) Hi @fuyuchenIfyw, I have the same problem. Could you manage to figure it out? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

ancia290 · 2024-04-04T22:37:21Z

I have the same problem.

loading about 300 3d images i got the following error

monai.transforms.croppad.dictionary CropForegroundd.init:allow_smaller: Current default value of argument allow_smaller=True has been deprecated since version 1.2. It will be changed to allow_smaller=False in version 1.5.
Loading dataset: 80%|████████████████████▋ | 207/260 [27:33<09:36, 10.87s/it] Killed

Any idea ?????

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the process loading the data killed? #157

Why is the process loading the data killed? #157

fuyuchenIfyw commented Dec 12, 2022

Jamshidhsp commented Mar 31, 2023

fuyuchenIfyw commented Apr 1, 2023 via email

ancia290 commented Apr 4, 2024

Why is the process loading the data killed? #157

Why is the process loading the data killed? #157

Comments

fuyuchenIfyw commented Dec 12, 2022

train_multi-gpu.py FAILED

Failures: <NO_OTHER_FAILURES>

Jamshidhsp commented Mar 31, 2023

fuyuchenIfyw commented Apr 1, 2023 via email

ancia290 commented Apr 4, 2024

Failures:
<NO_OTHER_FAILURES>