Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device plugin can't detect the vgpus #78

Open
esposem opened this issue Aug 30, 2023 · 2 comments
Open

Device plugin can't detect the vgpus #78

esposem opened this issue Aug 30, 2023 · 2 comments

Comments

@esposem
Copy link

esposem commented Aug 30, 2023

I currently have Openshift 4.13 with the Openshift Virtualization (CNV) installed.
I installed the nvidia drivers through https://github.com/vladikr/ocp-nvidia-vgpu-installer, and they work as expected.

I gave to the HyperConverged yaml file the following:

spec:
  mediatedDevicesConfiguration:
    mediatedDevicesTypes: 
    - nvidia-258
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID RTX6000-3Q"
      resourceName: "nvidia.com/GRID_RTX6000-3Q"
      externalResourceProvider: true

obviously checking that nvidia-258 exists:

$ cd /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types
$ cat nvidia-258/available_instances 
8

Then I created 2 mdev devices

$ UUID=$(uuidgen);
$ echo "${UUID}" > nvidia-258/create;
$ mdevctl define --auto --uuid $UUID;
$ mdevctl list

Then I installed the kubevirt-gpu-device-plugin, but when I inspect the nodes log I see

2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Nvidia device 0000:05:00.0
2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Iommu Map map[]
2023/08/29 09:36:53 Device Map map[]
2023/08/29 09:36:53 vGPU Map map[GRID_RTX6000-3Q:[{21ad712a-f454-498c-84d5-4116f3723c01} {43922f20-6573-4d6b-9223-a2ca02f83b29}]]
2023/08/29 09:36:53 GPU vGPU Map map[0000:05:00.0:[21ad712a-f454-498c-84d5-4116f3723c01 43922f20-6573-4d6b-9223-a2ca02f83b29]]
2023/08/29 09:36:53 Could not find NVIDIA device with id: GRID_RTX6000-3Q
2023/08/29 09:36:53 DP Name GRID_RTX6000-3Q
2023/08/29 09:36:53 Devicename GRID_RTX6000-3Q
2023/08/29 09:36:58 [GRID_RTX6000-3Q] Error registering with device plugin manager: context deadline exceeded
2023/08/29 09:36:58 Error starting GRID_RTX6000-3Q device plugin: context deadline exceeded

And I can't run any VMI/VM as once I schedule one, it is never scheduled as it doesn't find any vgpu available when I provide the following to the yaml file:

spec:
      gpus:
      - deviceName: nvidia.com/GRID_RTX6000-3Q
        name: vgpu1

What did I do wrong?

@esposem
Copy link
Author

esposem commented Aug 30, 2023

@cdesiniotis could you please take a look at this?

@rthallisey
Copy link
Collaborator

@esposem what version of the device-plugin were you using?
Usually this error appear when the pci.ids in the device plugin are out of date. This should go away in the newer versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants