Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal #3

Closed
Bailey-24 opened this issue May 22, 2023 · 12 comments
Closed

RuntimeError: CUDA error: invalid device ordinal #3

Bailey-24 opened this issue May 22, 2023 · 12 comments

Comments

@Bailey-24
Copy link

Bailey-24 commented May 22, 2023

I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center

why this happened? and how to solve?

Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 266, in inference_worker
    agent = agent_class(**agent_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/src/models/agent_fbe_owl.py", line 74, in __init__
    center_only=center_only)
  File "/home/pi/Desktop/RL_learning/cow/src/models/localization/clip_owl.py", line 104, in __init__
    self.model = MyOwlViTForObjectDetection.from_pretrained(owl_from_pretrained).eval().to(device)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
pi@pi:~$ nvidia-smi
Mon May 22 14:44:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 41%   50C    P0    50W / 150W |   2613MiB /  8192MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 41%   46C    P8    13W / 150W |     16MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                 65MiB |
|    0   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                501MiB |
|    0   N/A  N/A      3381      G   /usr/bin/gnome-shell               73MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       45MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess      150MiB |
|    0   N/A  N/A    711196      G   ...d-files --enable-crashpad       21MiB |
|    0   N/A  N/A    931804      G   ...mviewer/tv_bin/TeamViewer        4MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess       76MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144      146MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072       87MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144      116MiB |
|    0   N/A  N/A   3544121      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                  3MiB |
|    1   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                  3MiB |
+-----------------------------------------------------------------------------+
(cow) pi@pi:~/Desktop/RL_learning/cow$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
(cow) pi@pi:~/Desktop/RL_learning/cow$ python scripts/test_torch_download.py
torch.cuda.is_available(): True
torch.tensor([1]).to(0): tensor([1], device='cuda:0')
Looks good.

I have followed the solution in StackOverflow or GitHub, but it also has the same problem.
Is that the cuda vision is not correct?
I'm eager to use both two GPUs to run.

@sagadre
Copy link
Collaborator

sagadre commented May 22, 2023

Hi! Thanks for the question and the interest in the work. When developing this code, I was using a machine with 8 GPUs. I just pushed a change to make the code compatible with more machines. See here: 833f421

Note: for a 2 GPU machine, you may also want to try running with -n 2 or -n 4 if you find -n 8 is running into CPU or memory bottlenecks.

Let me know if you are still running into problems and thanks for the issue!

@Bailey-24
Copy link
Author

I have the same problem as issue4
I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center, after having the same problem as issue4, I change the timeout from 1000 to 10000, but the result is same.
image

here is the log after I enter ctrl + c

Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll                                                                                                                 
Process Process-1:                                                                                                                                                                                                 
Process Process-3:                                                                                                                                                                                                 
Process Process-7:                                                                                                                                                                                                 
Process Process-5:                                                                                                                                                                                                 
Process Process-2:                                                                                                                                                                                                 
Process Process-6:                                                                                                                                                                                                 
Process Process-4:                                                                                                                                                                                                 
Process Process-8:                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
KeyboardInterrupt
KeyboardInterrupt
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

my computer isn't out of memory

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 42%   47C    P5    27W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   6109MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 39%   43C    P8    12W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   4849MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                 65MiB |
|    0   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                530MiB |
|    0   N/A  N/A      3381      G   [/usr/bin/gnome-shell](https://file+.vscode-resource.vscode-cdn.net/usr/bin/gnome-shell)               64MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       21MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess       10MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess      167MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144       30MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072      126MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144       25MiB |
|    0   N/A  N/A   3798582      G   ...626843.log --shared-files      120MiB |
|    0   N/A  N/A   3823746      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823847      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823947      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3824047      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A   3823797      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823895      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823997      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3824097      C   ...onda3/envs/cow/bin/python     1207MiB |
+-----------------------------------------------------------------------------+

@Bailey-24
Copy link
Author

Bailey-24 commented May 24, 2023

I found it very slow to run these two lines
image

@sagadre
Copy link
Collaborator

sagadre commented May 24, 2023

Are the processes running at all or are the threads locking?

@Bailey-24
Copy link
Author

Yes, the processes running at all, I use -n 1 to debug.

About threads locking.
After ask GPT,

In your code, you have separate processes that are interacting with the send_queue and receive_queue. Each process accesses these queues independently, and the Queue implementation handles the necessary synchronization to ensure safe access.
Therefore, you don't need to manually handle locks or synchronization between the processes in this particular code snippet. The Queue object takes care of these aspects for you, allowing concurrent access from multiple processes without causing conflicts.

so I didn't lock the thread manually.

@Bailey-24
Copy link
Author

After running for whole night.
image

Would you please create the docker?

@tyz1030
Copy link

tyz1030 commented May 27, 2023

I have the same issue with @Bailey-24

@Bailey-24
Copy link
Author

Bailey-24 commented May 29, 2023

image

I think I ran the experiment, maybe because I use an 8GPUs machine.
But I have another question, how to visual, does it have GUI?

@Southyang
Copy link

Can it only run with 8GPU? I also want to know about GUI issues.

@sagadre
Copy link
Collaborator

sagadre commented May 30, 2023

@Bailey-24 was your only change to switch to an 8 GPU machine?
re: GUI script, I will work on one and push it later today

@Bailey-24
Copy link
Author

yes, I only change to switch to an 8 GPU machine.

@sagadre
Copy link
Collaborator

sagadre commented Jun 11, 2023

Interesting, will close this issue, but will open a new issue for <8 GPU testing

@sagadre sagadre closed this as completed Jun 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants