RuntimeError: CUDA error: invalid device ordinal #3

Bailey-24 · 2023-05-22T05:31:07Z

I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center

why this happened? and how to solve?

Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 266, in inference_worker
    agent = agent_class(**agent_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/src/models/agent_fbe_owl.py", line 74, in __init__
    center_only=center_only)
  File "/home/pi/Desktop/RL_learning/cow/src/models/localization/clip_owl.py", line 104, in __init__
    self.model = MyOwlViTForObjectDetection.from_pretrained(owl_from_pretrained).eval().to(device)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

pi@pi:~$ nvidia-smi
Mon May 22 14:44:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 41%   50C    P0    50W / 150W |   2613MiB /  8192MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 41%   46C    P8    13W / 150W |     16MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                 65MiB |
|    0   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                501MiB |
|    0   N/A  N/A      3381      G   /usr/bin/gnome-shell               73MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       45MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess      150MiB |
|    0   N/A  N/A    711196      G   ...d-files --enable-crashpad       21MiB |
|    0   N/A  N/A    931804      G   ...mviewer/tv_bin/TeamViewer        4MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess       76MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144      146MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072       87MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144      116MiB |
|    0   N/A  N/A   3544121      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   /usr/lib/xorg/Xorg                  3MiB |
|    1   N/A  N/A      3181      G   /usr/lib/xorg/Xorg                  3MiB |
+-----------------------------------------------------------------------------+

(cow) pi@pi:~/Desktop/RL_learning/cow$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cow) pi@pi:~/Desktop/RL_learning/cow$ python scripts/test_torch_download.py
torch.cuda.is_available(): True
torch.tensor([1]).to(0): tensor([1], device='cuda:0')
Looks good.

I have followed the solution in StackOverflow or GitHub, but it also has the same problem.
Is that the cuda vision is not correct?
I'm eager to use both two GPUs to run.

The text was updated successfully, but these errors were encountered:

sagadre · 2023-05-22T15:29:33Z

Hi! Thanks for the question and the interest in the work. When developing this code, I was using a machine with 8 GPUs. I just pushed a change to make the code compatible with more machines. See here: 833f421

Note: for a 2 GPU machine, you may also want to try running with -n 2 or -n 4 if you find -n 8 is running into CPU or memory bottlenecks.

Let me know if you are still running into problems and thanks for the issue!

Bailey-24 · 2023-05-24T02:55:39Z

I have the same problem as issue4
I ran command python pasture_runner.py -a src.models.agent_fbe_owl -n 8 --arch B32 --center, after having the same problem as issue4, I change the timeout from 1000 to 10000, but the result is same.

here is the log after I enter ctrl + c

Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll                                                                                                                 
Process Process-1:                                                                                                                                                                                                 
Process Process-3:                                                                                                                                                                                                 
Process Process-7:                                                                                                                                                                                                 
Process Process-5:                                                                                                                                                                                                 
Process Process-2:                                                                                                                                                                                                 
Process Process-6:                                                                                                                                                                                                 
Process Process-4:                                                                                                                                                                                                 
Process Process-8:                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run                                                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                                                      
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                                                                                                             
    self.run()                                                                                                                                                                                                     
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
File "/home/pi/anaconda3/envs/cow/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
KeyboardInterrupt
  File "/home/pi/Desktop/RL_learning/cow/robothor_challenge.py", line 267, in inference_worker
    controller = ai2thor.controller.Controller(**controller_kwargs)
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
KeyboardInterrupt
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 498, in __init__
    host=host,
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/pi/anaconda3/envs/cow/lib/python3.7/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
KeyboardInterrupt
KeyboardInterrupt
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

my computer isn't out of memory

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 42%   47C    P5    27W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   6109MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:A1:00.0 Off |                  Off |
| 39%   43C    P8    12W [/](https://file+.vscode-resource.vscode-cdn.net/) 150W |   4849MiB [/](https://file+.vscode-resource.vscode-cdn.net/)  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                 65MiB |
|    0   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                530MiB |
|    0   N/A  N/A      3381      G   [/usr/bin/gnome-shell](https://file+.vscode-resource.vscode-cdn.net/usr/bin/gnome-shell)               64MiB |
|    0   N/A  N/A      5141      G   ...2gtk-4.0/WebKitWebProcess       52MiB |
|    0   N/A  N/A     28551      G   ...RendererForSitePerProcess       21MiB |
|    0   N/A  N/A    235930      G   ...RendererForSitePerProcess       10MiB |
|    0   N/A  N/A   3011973      G   ...RendererForSitePerProcess      167MiB |
|    0   N/A  N/A   3258465      G   ...300715944505616879,262144       30MiB |
|    0   N/A  N/A   3268641      G   ...155906284107188537,131072      126MiB |
|    0   N/A  N/A   3489731      G   ...093122278100996567,262144       25MiB |
|    0   N/A  N/A   3798582      G   ...626843.log --shared-files      120MiB |
|    0   N/A  N/A   3823746      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823847      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3823947      C   ...onda3/envs/cow/bin/python     1207MiB |
|    0   N/A  N/A   3824047      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A      1842      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A      3181      G   [/usr/lib/xorg/Xorg](https://file+.vscode-resource.vscode-cdn.net/usr/lib/xorg/Xorg)                  3MiB |
|    1   N/A  N/A   3823797      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823895      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3823997      C   ...onda3/envs/cow/bin/python     1207MiB |
|    1   N/A  N/A   3824097      C   ...onda3/envs/cow/bin/python     1207MiB |
+-----------------------------------------------------------------------------+

Bailey-24 · 2023-05-24T07:18:20Z

I found it very slow to run these two lines

sagadre · 2023-05-24T09:26:14Z

Are the processes running at all or are the threads locking?

Bailey-24 · 2023-05-24T09:54:59Z

Yes, the processes running at all, I use -n 1 to debug.

About threads locking.
After ask GPT,

In your code, you have separate processes that are interacting with the send_queue and receive_queue. Each process accesses these queues independently, and the Queue implementation handles the necessary synchronization to ensure safe access.
Therefore, you don't need to manually handle locks or synchronization between the processes in this particular code snippet. The Queue object takes care of these aspects for you, allowing concurrent access from multiple processes without causing conflicts.

so I didn't lock the thread manually.

Bailey-24 · 2023-05-25T02:44:48Z

After running for whole night.

Would you please create the docker?

tyz1030 · 2023-05-27T21:15:31Z

I have the same issue with @Bailey-24

Bailey-24 · 2023-05-29T07:18:28Z

I think I ran the experiment, maybe because I use an 8GPUs machine.
But I have another question, how to visual, does it have GUI?

Southyang · 2023-05-30T01:51:23Z

Can it only run with 8GPU? I also want to know about GUI issues.

sagadre · 2023-05-30T17:05:56Z

@Bailey-24 was your only change to switch to an 8 GPU machine?
re: GUI script, I will work on one and push it later today

Bailey-24 · 2023-05-31T05:26:32Z

yes, I only change to switch to an 8 GPU machine.

sagadre · 2023-06-11T17:40:34Z

Interesting, will close this issue, but will open a new issue for <8 GPU testing

sagadre closed this as completed Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal #3

RuntimeError: CUDA error: invalid device ordinal #3

Bailey-24 commented May 22, 2023 •

edited

Loading

sagadre commented May 22, 2023

Bailey-24 commented May 24, 2023

Bailey-24 commented May 24, 2023 •

edited

Loading

sagadre commented May 24, 2023

Bailey-24 commented May 24, 2023

Bailey-24 commented May 25, 2023

tyz1030 commented May 27, 2023

Bailey-24 commented May 29, 2023 •

edited

Loading

Southyang commented May 30, 2023

sagadre commented May 30, 2023

Bailey-24 commented May 31, 2023

sagadre commented Jun 11, 2023

RuntimeError: CUDA error: invalid device ordinal #3

RuntimeError: CUDA error: invalid device ordinal #3

Comments

Bailey-24 commented May 22, 2023 • edited Loading

sagadre commented May 22, 2023

Bailey-24 commented May 24, 2023

Bailey-24 commented May 24, 2023 • edited Loading

sagadre commented May 24, 2023

Bailey-24 commented May 24, 2023

Bailey-24 commented May 25, 2023

tyz1030 commented May 27, 2023

Bailey-24 commented May 29, 2023 • edited Loading

Southyang commented May 30, 2023

sagadre commented May 30, 2023

Bailey-24 commented May 31, 2023

sagadre commented Jun 11, 2023

Bailey-24 commented May 22, 2023 •

edited

Loading

Bailey-24 commented May 24, 2023 •

edited

Loading

Bailey-24 commented May 29, 2023 •

edited

Loading