-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto3dseg error using multiple GPUs #7238
Comments
Has anyone seen this? I can add the configs I used if needed. |
Hi @pwrightkcl, did you try monai with the latest version? The |
I tried making sure my docker image has monai 1.3 and rerun but got the same error. I can post debug info and error log again if you like. The first time, I didn't realise I needed |
Hi @pwrightkcl, I couldn't reproduce the issue. |
Hi @KumoLiu I have tried the code (minus the matplot lib parts because I'm running inside a docker container) and get the same error. The docker image I'm using use projectmonai/monai:latest. I have to use torch==1.13 to match our cluster's Cuda version, so this may be a Cuda version issue. I added a line to show that (11.7). Our cluster is being upgraded shortly, so if you think it's Cuda I'll try again after the upgrade (up to a week from now). Here's the output. I notice that the first part is repeated twice, possibly something to do with the parallelisation. I have not set the OMP_NUM_THREADS environment variable for this run, but setting it in the past didn't make any difference.
|
Our cluster has been updated to the latest Nvidia drivers, so I tried training with multiple GPUs again, using the latest MONAI Docker image, but got the same errors. Attaching to save space. |
@KumoLiu I had another look at the test code you gave and saw it sets the CUDA_VISIBLE_DEVICES environment variable. When I set that, it works, and says "Found 1 GPUs for data analyzing!" as expected and runs to completion. When I omit the environment variable and add I'm attaching the logs and you can see that the multi GPU version gets to "Found 3 GPUs for data analyzing!" then repeats the config info before crashing the second time it reaches "Found 3 GPUs". So it looks like the script itself is running twice. This is similar to the training script above, which repeats the first log lines, but doesn't have the config line like the test script. Does this help diagnose what is going wrong? |
Hi @pwrightkcl, perhaps the issue is due to the |
Thank you for the suggestion. I'm new to Autoseg3d so can you clarify what you want me to try? I have seen the tutorial breaking down the components of AutoRunner, so I could run DataAnalyzer first on one GPU then run the other steps. Is that what you mean, or is there an input to AutoRunner to tell it to skip the DataAnalyzer step? I'd be interested if anyone can replicate this problem, since I was able to elicit it just using the Hello World demo. |
@ericspod advised me to put my script inside a Here's the log: autoseg3d-test-multi-75dd8bd6d2db.log This log brings up two related questions about multi-gpu:
I realise that these questions, although related, are outside the specific issue, so I can move them to Discussions if you prefer. |
@KumoLiu I tried autoseg with multiple gpus and it fails with the same stacktrace in DataAnalyzer indeed. |
@kretes Sorry for slow response (just got back from leave). Just to confirm, the solution for me was to put my code in a |
Describe the bug
When I run auto3dseg with multiple GPUs, it gives an error relating to one process being started before another has finished bootstrapping. It runs with a single GPU.
To Reproduce
Steps to reproduce the behavior:
I am using a Docker image built on the latest MONAI image. I am submitting the image to RunAI with
--gpu 4
.Expected behavior
Run runner does data analysis and begins to train.
Here is a log using one GPU where training starts as expected when requesting only a single GPU.
Screenshots
Here is the log:
Environment
Ensuring you use the relevant python executable, please paste the output of:
Note that the command above didn't work, so I had to make a little .py script with each command on one line.
The output is for whatever node the debug job was assigned to on our cluster. I requested the same resources
as for the training job that failed, but it may not be the same machine as the one that ran my training script,
as there are three different kinds of DGXs on our cluster.
Additional context
CC: @marksgraham
The text was updated successfully, but these errors were encountered: