Device plugin can't detect the vgpus #78

esposem · 2023-08-30T09:09:10Z

I currently have Openshift 4.13 with the Openshift Virtualization (CNV) installed.
I installed the nvidia drivers through https://github.com/vladikr/ocp-nvidia-vgpu-installer, and they work as expected.

I gave to the HyperConverged yaml file the following:

spec:
  mediatedDevicesConfiguration:
    mediatedDevicesTypes: 
    - nvidia-258
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID RTX6000-3Q"
      resourceName: "nvidia.com/GRID_RTX6000-3Q"
      externalResourceProvider: true

obviously checking that nvidia-258 exists:

$ cd /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types
$ cat nvidia-258/available_instances 
8

Then I created 2 mdev devices

$ UUID=$(uuidgen);
$ echo "${UUID}" > nvidia-258/create;
$ mdevctl define --auto --uuid $UUID;
$ mdevctl list

Then I installed the kubevirt-gpu-device-plugin, but when I inspect the nodes log I see

2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Nvidia device 0000:05:00.0
2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Iommu Map map[]
2023/08/29 09:36:53 Device Map map[]
2023/08/29 09:36:53 vGPU Map map[GRID_RTX6000-3Q:[{21ad712a-f454-498c-84d5-4116f3723c01} {43922f20-6573-4d6b-9223-a2ca02f83b29}]]
2023/08/29 09:36:53 GPU vGPU Map map[0000:05:00.0:[21ad712a-f454-498c-84d5-4116f3723c01 43922f20-6573-4d6b-9223-a2ca02f83b29]]
2023/08/29 09:36:53 Could not find NVIDIA device with id: GRID_RTX6000-3Q
2023/08/29 09:36:53 DP Name GRID_RTX6000-3Q
2023/08/29 09:36:53 Devicename GRID_RTX6000-3Q
2023/08/29 09:36:58 [GRID_RTX6000-3Q] Error registering with device plugin manager: context deadline exceeded
2023/08/29 09:36:58 Error starting GRID_RTX6000-3Q device plugin: context deadline exceeded

And I can't run any VMI/VM as once I schedule one, it is never scheduled as it doesn't find any vgpu available when I provide the following to the yaml file:

spec:
      gpus:
      - deviceName: nvidia.com/GRID_RTX6000-3Q
        name: vgpu1

What did I do wrong?

The text was updated successfully, but these errors were encountered:

esposem · 2023-08-30T14:00:58Z

@cdesiniotis could you please take a look at this?

rthallisey · 2024-01-24T15:24:35Z

@esposem what version of the device-plugin were you using?
Usually this error appear when the pci.ids in the device plugin are out of date. This should go away in the newer versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device plugin can't detect the vgpus #78

Device plugin can't detect the vgpus #78

esposem commented Aug 30, 2023

esposem commented Aug 30, 2023

rthallisey commented Jan 24, 2024

Device plugin can't detect the vgpus #78

Device plugin can't detect the vgpus #78

Comments

esposem commented Aug 30, 2023

esposem commented Aug 30, 2023

rthallisey commented Jan 24, 2024