-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the process loading the data killed? #157
Comments
Hi @fuyuchenIfyw, |
I'm sorry that I ultimately didn't solve the problem. I believe it may have been due to insufficient hardware resources. In the end, I gave up on using MONAI and instead opted to develop with the GitHub - MIC-DKFZ/nnUNet framework.
Echo
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2023年3月31日(星期五) 晚上11:00
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [Project-MONAI/research-contributions] Why is the process loading the data killed? (Issue #157)
Hi @fuyuchenIfyw,
I have the same problem. Could you manage to figure it out?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I have the same problem. loading about 300 3d images i got the following error monai.transforms.croppad.dictionary CropForegroundd.init:allow_smaller: Current default value of argument Any idea ????? |
Describe the bug
Hello, I met a bug about cacheDataset when I follow the training way provided by the research-contributions/DiNTS/train_multi-gpu.py. I used the MSD 03 liver dataset, when using the cachedataset to load the data, I encountered this problem in the middle of the loading:
`WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239172 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239173 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 239174) of binary: /home/fyc/anaconda3/envs/cv/bin/python
Traceback (most recent call last):
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_multi-gpu.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-12_16:06:40
host : amax
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 239174)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 239174
=======================================================`
To Reproduce
Steps to reproduce the behavior:
I did'nt follow the README.md to use docker, I use conda env instead.
Expected behavior
When I set cache rate of cachedataset to 0.0, it can be trained normally. But when I set cache rate to 1.0, the error will occur.
Screenshots
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: