Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesla M40 Problems & Memory Allocation Limit with Tesla M40 24GB -> Tesla M60 remapping #62

Open
BlaringIce opened this issue Jul 23, 2021 · 64 comments

Comments

@BlaringIce
Copy link

First and primary:
I'm coming from a setup where I was using a GTX 1060 with vgpu_unlock just fine, but figured I'd step it up so that I could support more VMs. So, I'm currently trying to use a Tesla M40. Being a Tesla card, you might expect not to need vgpu_unlock, but this is one of the few Tesla's that doesn't support it natively. So, I'm trying to use nvidia-18 types from the M60 profiles with my VMs. I'm aware that I should be using a slightly older driver to match my host driver. However, I'm still getting a code 43 when I load my guest. I would provide some logs here, but I'm not sure what I can include since the entries for the two vgpu services both seem to be fine with no errors other than nvidia-vgpu-mgr[2588]: notice: vmiop_log: display_init inst: 0 successful at the end of trying to initialize the mdev device when the VM starts up. Please let me know any other information that I can provide to help debug/troubleshoot.
Second:
This is probably one of the few instances where this is a problem since most GeForce/Quadro cards have less memory than their vGPU capable counterparts. However, I have a Tesla M40 GPU that has 24 GB of vRAM (in two separate memory regions I would guess, although this SKU isn't listed on the Nvidia graphics processing units Wikipedia page, so I'm not 100% sure). This is in comparison to the Tesla M60's 2x8GB configuration, of which, only 8GB is available for allocation in vGPU.
I'm not sure whether the max_instance quantity, as seen in mdevctl types, is defined on the Nvidia driver side, in the vgpu_unlock side, or if it's a mix and the vgpu_unlock side might be able to do something about it.
What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.

@DualCoder
Copy link
Owner

However, I'm still getting a code 43 when I load my guest.

Code 43 is sort of Nvidia's catch-all error, it doesn't really provide any useful information. I think you have two options:

  1. Create a new VM with a clean config/bios/disk and reinstall Windows and the matching drivers from scratch.
  2. Test with a Linux guest. The Linux drivers tend to provide human readable error messages. If needed, you can create the file /etc/modprobe.d/nvidia.conf with the line options nvidia NVreg_ResmanDebugLevel=0 to enable verbose output from the driver (this works on both linux hosts and guests).

What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.

It seems that the M60 is quite special. If you compare the specs here:
https://www.pny.eu/en/consumer/explore-all-products/legacy-products/602-tesla-m60-r2l
https://www.pny.eu/en/consumer/explore-all-products/legacy-products/696-tesla-m40-24gb
You can see that the M60 explicitly states "16 GB GDDR5 (8 GB per board)", so I would expect your M40 to be technically capable of using all 24GB. However, the profiles available are determined by the driver and the current version of vgpu_unlock does not attempt to alter them in any way. If you do get the card working I will see if I can create a workaround, it would also be useful for utilizing the full 11GB of a 1080ti.

@ualdayan
Copy link

ualdayan commented Jul 24, 2021

I'm having issues with a M40 too. Dmesg wasn't returning anything, but eventually I figured out I needed to go into hooks.c and turn on logging. Oddly though I still don't see any of the syslog stuff from the main script file anywhere in logs, but now I do at least see vGPU unlock patch applied. Remap called.

I also saw 'nvidia-vgpu-mgr[4819]: op_type: 0xa0810115 failed.'

Still error 43 in windows with 443.18 drivers, in Linux it says 'probe of 0000:01:00.0 failed with error -1'.

Also just tried passing it right through without modifying IDs, then installing drivers that were bundled with the Linux VGPU drivers, it recognized it as an Nvidia GRID M60-2Q, but still failed with code 43.

@BlaringIce
Copy link
Author

BlaringIce commented Jul 24, 2021

Well, I've made the decision to go ahead and return the card while I'm still inside of the return window. I'll likely still have the card for a day or two if there's anything specific I can try. As for what I found since the initial post: I made a Linux guest, which I'm admittedly not as familiar with running nvidia drivers for as I've only use Linux with Nvidia accelerate graphics on an older machine with a GTX 650.
I first tried to install on xubuntu 20.04 but I couldn't figure out how their built in store's drivers worked and I got warnings about using the store instead when I tried to install manually.
So, after that I tried switching over to Rocky Linux 8 (closer to the environment I'm familiar with from the GTX 650). In the latter driver the output wasn't as verbose and pretty much literally just said that it couldn't load the 'nvidia-drm' kernel module during the install process then quit.
The older driver that I tried gave a little more info in its dkms make log:
/var/lib/dkms/nvidia/450.66/build/nvidia/nv-pci.c: In function 'nv_pci_probe':
/var/lib/dkms/nvidia/450.66/build/nvidia/nv-pci.c:427.5: error: implicit declaration of function 'vga_tryget'; did you mean 'vga_get'? [-Werror=implicit-function-declaration]
vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);
^~~~~~~~~~
vga_get
cc1: some warnings being treated as errors
But that doesn't tell me much either other than that maybe the, probably older, kernel and/or gcc version in Rocky Linux may not be happy with compiling the driver code.
I can try any ideas that anyone else has during the time that I still have the card.

@ualdayan
Copy link

When you start a guest does nvidia-smi report anything under processes for you? For me on the M40 it always returns 'No running processes found'.

@DualCoder
Copy link
Owner

Still error 43 in windows with 443.18 drivers, in Linux it says 'probe of 0000:01:00.0 failed with error -1'.

Have you tried enabling verbose logging using options nvidia NVreg_ResmanDebugLevel=0 (see my previous comment) and did that provide any more information?

When you start a guest does nvidia-smi report anything under processes for you? For me on the M40 it always returns 'No running processes found'.

It is supposed to list a vgpu process for each running VM, but if the driver fails to load in the VM then it is probably not listed, so you should focus on getting the guest driver to work.

I also saw 'nvidia-vgpu-mgr[4819]: op_type: 0xa0810115 failed.'

These op_type: 0xNNNNN failed. messages can be ignored unless they are immediately followed by a more serious looking error.

@ualdayan
Copy link

ualdayan commented Jul 24, 2021

I enabled verbose logging and here's the log entries:

This on repeat:
Jul 24 14:47:05 pop-os kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Jul 24 14:47:05 pop-os kernel:
Jul 24 14:47:05 pop-os kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:17f0)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 465.31 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: in this release's README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
Jul 24 14:47:05 pop-os kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Jul 24 14:47:05 pop-os kernel: NVRM: None of the NVIDIA devices were initialized.
Jul 24 14:47:05 pop-os kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 236
Jul 24 14:47:05 pop-os systemd-udevd[562]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Jul 24 14:47:06 pop-os kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236

And then this:
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-drm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd-udevd[10966]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jul 24 14:53:51 pop-os systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Jul 24 14:53:51 pop-os systemd[1]: Failed to start NVIDIA Persistence Daemon.

This is with passing through the devid of a Quadro M6000, and using drivers that claim to support Quadro M6000s.

Random thought: If you pass the M40 directly through to a virtual machine at first it doesn't work because it's in some kind of compute only mode, but then after you change the driver mode (nvidia-smi -g 0 -dm 0) it starts to function more like a regular GPU. It doesn't seem to be a persistent thing - eg it's saved somewhere in the registry of the Windows VM rather than somewhere on the card itself. In Linux nvidia-smi tells you the mode can't be changed. What if it's stuck in some kind of compute mode in Linux (but not in Windows), and that's why it isn't enabling vGPU since compute mode has to be off for the other Tesla cards before vGPU can be enabled?

@DualCoder
Copy link
Owner

Jul 24 14:47:05 pop-os kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:17f0)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 465.31 driver release.

Two problems here:

  1. The PCI ID 10DE:17F0 is for the Quadro M6000. This should be one of the M60-<digit><letter> vGPU profiles.
  2. The driver 465.31 is not listed as a vGPU driver here: https://docs.nvidia.com/grid/index.html

So please try again without any PCI spoofing tricks in the qemu configuration and use an officially supported driver version.

Random thought: If you pass the M40 directly through to a virtual machine at first it doesn't work because it's in some kind of compute only mode, but then after you change the driver mode (nvidia-smi -g 0 -dm 0) it starts to function more like a regular GPU. It doesn't seem to be a persistent thing - eg it's saved somewhere in the registry of the Windows VM rather than somewhere on the card itself. In Linux nvidia-smi tells you the mode can't be changed. What if it's stuck in some kind of compute mode in Linux (but not in Windows), and that's why it isn't enabling vGPU since compute mode has to be off for the other Tesla cards before vGPU can be enabled?

There might be something to this, Nvidia provides the gpumodeswitch tool to change the Tesla M60 and M6 cards between compute and graphics mode:
https://docs.nvidia.com/grid/12.0/grid-gpumodeswitch-user-guide/index.html
As far as I can tell this is a persistent change and the card will store the active mode in on-board non-volatile memory. Maybe nvidia-smi -a -i 0 can be used to read out the mode?

@BlaringIce
Copy link
Author

Well, I wasn't able to use nvidia-smi to tell, but I did try the gpumodeswitch tools. Sure enough, the card is in compute mode:
Tesla M40 (10DE,17FD,10DE,1173) H:--:NRM S:00,B:21,PCI,D:00,F:00
Adapter: Tesla M40 (10DE,17FD,10DE,1173) H:--:NRM S:00,B:21,PCI,D:00,F:00

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
InfoROM Version : G600.0200.02.02

Tesla M40 (10DE,17FD,10DE,1173) --:NRM 84.00.56.00.03
InfoROM Version : G600.0200.02.02
GPU Mode : Compute

From there I was able to use ./gpumodeswitch --gpumode graphics --auto and a quick reboot later, I was in Graphics mode (identical output to above, but replacing "Compute" with "Graphics"). Note: I had to use dkms to temporarily uninstall my host driver during this process since gpumodeswitch did not like it running at the same time.

Unfortunately, after this I reinstalled the driver but I'm still getting a code 43 in Windows and in Linux I'm still having trouble even installing the driver. I did finally realize that I need to blacklist nouveau but I'm still getting errors. Just running the installer normally, I get an error about the DRM-KMS module not being built correctly. I'm not sure if excluding that from the compilation would be a problem, but I gave it a shot with the --no-drm flag. That didn't do much better though:
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA GPU(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

I have a couple ideas on some more things I can try. I'll report back if I have any more positive changes.

@BlaringIce
Copy link
Author

Well, I've tried my couple other ideas. Unfortunately, I had no luck with any of them either. I upgrade to host driver version 460.73.02. Tried the windows guest from there with 462.31 with no luck. Moved on to linux from there. I did finally get the driver to install, technically, (version 460.73.01 this time) but I did still need to use the --no-drm flag to do it.
Now that it's installed, nvidia-smi still gives me the error where it can't communicate with the driver. And... I'm not really sure if the output is that helpful since it doesn't look significantly different from @ualdayan 's output, but here's the results from running dmesg | grep -i nvidia

[ 5.737475] nvidia: loading out-of-tree module taints kernel.
[ 5.737487] nvidia: module license 'NVIDIA' taints kernel.
[ 5.751686] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.763114] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 5.764336] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 5.764496] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e)
NVRM: NVIDIA 460.73.01 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: specific graphics driver download page at www.nvidia.com.
[ 5.765856] nvidia: probe of 0000:01:00.0 failed with error -1
[ 5.765880] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 5.765881] NVRM: None of the NVIDIA devices were initialized.
[ 5.766848] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241
[ 29.386194] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 29.387996] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 29.388154] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e)
NVRM: NVIDIA 460.73.01 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: specific graphics driver download page at www.nvidia.com.
[ 29.388841] nvidia: probe of 0000:01:00.0 failed with error -1
[ 29.388881] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 29.388881] NVRM: None of the NVIDIA devices were initialized.
[ 29.389445] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241
[ 30.584459] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 30.586215] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 30.586376] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e)
NVRM: NVIDIA 460.73.01 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: specific graphics driver download page at www.nvidia.com.
[ 30.586991] nvidia: probe of 0000:01:00.0 failed with error -1
[ 30.587041] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 30.587042] NVRM: None of the NVIDIA devices were initialized.
[ 30.587214] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241
[ 33.497634] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 33.504624] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 33.504782] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e)
NVRM: NVIDIA 460.73.01 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: specific graphics driver download page at www.nvidia.com.
[ 33.505502] nvidia: probe of 0000:01:00.0 failed with error -1
[ 33.505535] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 33.505536] NVRM: None of the NVIDIA devices were initialized.
[ 33.509144] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241

I'm not really sure how assigning the PCI device ID works when you're normally passing a device through with vGPU, but I tried looking up the GRID M60-2Q profile that I'm using and found a result that said it should be 114e, so that's what I tried. Hopefully that right.

Anyways, please let me know if there's anything else I can try out.

@BlaringIce
Copy link
Author

Oh, forgot to mention that on the host side I keep getting messages that say:
[nvidia-vgpu-vfio] [[DEVICE UUID HERE]]: vGPU migration disabled
I'm not sure if that really means anything important in this situation though. I'm assuming it means migrating to another GPU, and since I don't have one, I guess that makes sense.

@ualdayan
Copy link

My Tesla M40 doesn't seem to be compatible with gpumodeswitch like yours. For me it says:
Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
NOTE: Preserving straps from original image.
Command id:1000000E Command: NV_UCODE_CMD_COMMAND_VV failed
Command Status:NV_UCODE_CMD_STS_NEW
Error: NV_UCODE_ERR_CODE_CMD_VBIOS_VERIFY_BIOS_SIG_FAIL

Command id:000E Command: NV_UCODE_CMD_COMMAND_VV failed
Command Status:NV_UCODE_CMD_STS_NONE
Error: NV_UCODE_ERR_CODE_CMD_VBIOS_VERIFY_BIOS_SIG_FAIL

BCRT Error: Certificate 2.0 verification failed

ERROR: BIOS Cert 2.0 Verifications Error, Update aborted.

@DualCoder
Copy link
Owner

[nvidia-vgpu-vfio] [[DEVICE UUID HERE]]: vGPU migration disabled

This is expected since Qemu/KVM does not support the migration feature of the vGPU drivers.

[ 5.764496] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:114e)
NVRM: NVIDIA 460.73.01 driver release.

Ok, this is an improvement, the driver 460.73.01 is supported, but the PCI ID 10de:114e is weird. There is no NVIDIA device with that ID. Can you provide the output of lspci -vvnn for both the host and guest?

I'm not really sure how assigning the PCI device ID works when you're normally passing a device through with vGPU, but I tried looking up the GRID M60-2Q profile that I'm using and found a result that said it should be 114e, so that's what I tried. Hopefully that right.

Are you assigning it manually? Why? If you insist on setting it yourself, it should be:

Vendor ID: 0x10DE (NVIDIA)
Device ID: 0x13F2 (Tesla M60)
Subsystem Vendor ID: 0x10DE (NVIDIA)
Subsystem Device ID: 0x114E (M60-2Q)

@BlaringIce
Copy link
Author

Are you assigning it manually? Why? If you insist on setting it yourself, it should be

I am assigning manually, but only really out of ignorance for the 'normal' way you would do it. I'll change the parameters to match what you have there, at least.

Can you provide the output of lspci -vvnn for both the host and guest?

Sure, see the attached files.
hostlspcivvnn.txt
guestlspcivvnn.txt

@BlaringIce
Copy link
Author

I'm wondering if there's a possibility here that the GM200 chipset is just built, up and down the stack, not to support vGPU at a hardware level. Since it's marketed as incompatible on the only Tesla card that uses that silicon (the M40), and the other cards (980 Ti, Maxwell Titan X, and Quadro M6000) wouldn't be expected to have it work anyways, maybe it just totally locked out? I'm not sure if Nvidia would really go to such lengths to design it that way since it would deviate from their lower-tier designs. Do you know of any confirmed cases of someone getting vGPU working on a 980 Ti, Titan X, or M6K?

@DualCoder
Copy link
Owner

Can you provide the output of lspci -vvnn for both the host and guest?

Sure, see the attached files.

These looks correct, the dive shows up as a VGA controller with a 256 MB BAR1, so it is in the correct graphics mode. And the device in the guest shows up with the correct PCI IDs.

I am assigning manually, but only really out of ignorance for the 'normal' way you would do it.

I'm guessing that you are setting it either using the qemu command line with an argument like -set device.hostdev0.x-pci-vendor-id=NNNN or in a libvirt xml file with something like:

<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev0.x-pci-vendor-id=NNNN'/>

so the "normal" way is to not pass those arguments/xml elements (i.e remove them).

I'm wondering if there's a possibility here that the GM200 chipset is just built, up and down the stack, not to support vGPU at a hardware level. Since it's marketed as incompatible on the only Tesla card that uses that silicon (the M40), and the other cards (980 Ti, Maxwell Titan X, and Quadro M6000) wouldn't be expected to have it work anyways, maybe it just totally locked out? I'm not sure if Nvidia would really go to such lengths to design it that way since it would deviate from their lower-tier designs.

There is a possibility that there exists some technical limitation that prevents this from working, yes. But vGPU is a software solution and doesn't rely on the existence of some special hardware feature to function, however if the hardware is special in its design (like the GTX 970's 3.5+0.5 memory layout) it might be incompatible.

Do you know of any confirmed cases of someone getting vGPU working on a 980 Ti, Titan X, or M6K?

I do not.

I'll change the parameters to match what you have there, at least.

Now it looks correct, does the driver still complain about the device being unsupported?

@BlaringIce
Copy link
Author

BlaringIce commented Jul 28, 2021

Well, specifically nvidia-smi says:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I'm not sure how to check the driver itself. I'm not sure what the service name is for the normal linux guest drivers and a quick Google didn't really give me anything nor did tab completion give me anything with systemctl status nvi[tab here]. Otherwise I would try to check the driver itself instead of just SMI. I tried reinstalling the driver, too, once I reset the IDs to what you'd said and it still complained about DRM-KMS, so I did have to use the --no-drm flag to install.

@DualCoder
Copy link
Owner

It should install without the --no-drm flag now that the IDs are correct. For nvidia-smi not working I would check for errors in dmesg on both host and guest, and journalctl -u nvidia-vgpu-mgr on the host. Otherwise /var/log/Xorg.0.log might give some info if X fails to start, but I don't think it will even try if you installed with --no-drm.

What error does the installer give that prevents it from installing without --no-drm?

@DualCoder
Copy link
Owner

When looking around I noticed that there are two 460.73.01 drivers, you want the -grid version, check the checksum.

sha256sum NVIDIA-Linux-x86_64-460.73.01*
d10eda9780538f9c7a222aa221405f51cb31e2b7d696b2c98b751cc0fd6e037d  NVIDIA-Linux-x86_64-460.73.01-grid.run
11b1c918de26799e9ee3dc5db13d8630922b6aa602b9af3fbbd11a9a8aab1e88  NVIDIA-Linux-x86_64-460.73.01.run

I also found that Google publishes the files here https://cloud.google.com/compute/docs/gpus/grid-drivers-table

The non-grid version explicitly lists the 24GB M40 as supported, so I do not understand why it refuses to work.

@BlaringIce
Copy link
Author

Well, I was able to load the GRID version of the driver without any errors during install. Having done so now has nvidia-smi giving the very uninteresting output of No devices were found

With the GRID driver installed I get an output for dmesg | grep -i nvidia of:
[ 2.743559] nvidia: loading out-of-tree module taints kernel.
[ 2.743569] nvidia: module license 'NVIDIA' taints kernel.
[ 2.756062] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.766318] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 2.767463] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 2.768253] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.73.01 Thu Apr 1 21:40:36 UTC 2021
[ 2.813590] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.73.01 Thu Apr 1 21:32:31 UTC 2021
[ 2.817926] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 2.817930] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[ 6.833690] NVRM: nvidia_open...
[ 6.833696] NVRM: nvidia_ctl_open
[ 6.835930] NVRM: nvidia_open...
[ 6.900405] NVRM: nvidia_open...
[ 6.949965] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 6.949966] NVRM: nvidia_ctl_close
[ 6.950939] NVRM: nvidia_open...
[ 6.950942] NVRM: nvidia_ctl_open
[ 6.951226] NVRM: nvidia_open...
[ 7.006033] NVRM: nvidia_open...
[ 7.059159] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 7.059160] NVRM: nvidia_ctl_close
[ 9.538961] NVRM: nvidia_open...
[ 9.538965] NVRM: nvidia_ctl_open
[ 9.542569] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 9.542571] NVRM: nvidia_ctl_close
[ 10.190619] NVRM: nvidia_open...
[ 10.190622] NVRM: nvidia_ctl_open
[ 10.346947] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 10.346949] NVRM: nvidia_ctl_close
[ 10.817518] NVRM: nvidia_open...
[ 10.817523] NVRM: nvidia_ctl_open
[ 11.028236] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 11.028239] NVRM: nvidia_ctl_close
[ 11.327625] NVRM: nvidia_open...
[ 11.327629] NVRM: nvidia_ctl_open
[ 11.390885] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 11.390887] NVRM: nvidia_ctl_close
[ 11.932444] NVRM: nvidia_open...
[ 11.932448] NVRM: nvidia_ctl_open
[ 12.004938] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 12.004939] NVRM: nvidia_ctl_close
[ 12.394800] NVRM: nvidia_open...
[ 12.394807] NVRM: nvidia_ctl_open
[ 12.504682] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 12.504684] NVRM: nvidia_ctl_close
[ 15.071319] NVRM: nvidia_open...
[ 15.071326] NVRM: nvidia_ctl_open
[ 15.104593] NVRM: nvidia_open...
[ 15.142268] NVRM: nvidia_open...
[ 61.436038] NVRM: nvidia_open...
[ 61.436042] NVRM: nvidia_ctl_open
[ 61.436364] NVRM: nvidia_open...
[ 61.476481] NVRM: nvidia_open...
[ 61.515963] NVRM: GPU 0000:00:00.0: nvidia_close on GPU with minor number 255
[ 61.515964] NVRM: nvidia_ctl_close

@BlaringIce
Copy link
Author

Same query on the host give... this.
hostdmesg.log

@DualCoder
Copy link
Owner

It looks like it tries to load now (then fails, then tries again, ...). But I can't see any error being printed, can you provide the log without the grep -i nvidia filter? Also, make sure that the nouveau driver is properly blacklisted in the guest.

@BlaringIce
Copy link
Author

Unfortunately this will probably be my last post regarding the M40 - maybe someone else can pick this up in the future, but I've got to return mine now and I've got an M60 in that I can try out instead. Thank you for all the help with this though!

Here are the full guest and host dmesg logs in case they reveal something useful:
guestdmesgfull.log

@BlaringIce
Copy link
Author

hostdmesgfull.log

@haywoodspartan
Copy link

haywoodspartan commented Sep 27, 2021

I was able to get the vGPU to split with the Tesla M40 24GB with Proxmox Host and the vgpu_unlock script. If needed I had to use a hacky way of doing it with a spoof on the vgpu itself. However I am limited to only doing 1 vgpu on this card at any given time for some reason. I wonder if there was a way to give it more availability since I have the VRAM to do it and I have ECC disabled on the card. For testing purposes I have this on my home server behind a load balancer. If the dev wants to mess around with it he can if he needs a working debug environment.

@DualCoder
Copy link
Owner

If needed I had to use a hacky way of doing it with a spoof on the vgpu itself.

That's interesting, did you see the same issues as reported previously in this issue? Do you mind sharing details on this "hacky way"?

However I am limited to only doing 1 vgpu on this card at any given time for some reason. I wonder if there was a way to give it more availability since I have the VRAM to do it and I have ECC disabled on the card.

I assume this means you were able to create a single vGPU instance, assign it to a VM, and then load the drivers inside the VM to get hardware acceleration. This would be good news for the Tesla M40. In order to use multiple instances at the same time you should check the following:

  • Each instance needs to be of the same type. For example 2x M60-2Q is OK, but 1x M60-2Q + 1x M60-4Q is not.
  • Each instance's UUID needs to be unique.
  • The total VRAM should not exceed 8 GB (this is the limit for the M60, and I think the software will enforce it).

If you can provide error messages or log files that would be helpful too.

@FallingSnow
Copy link

FallingSnow commented Jan 25, 2022

@BlaringIce How did you get your M40 into graphics mode? Mine won't seem to switch. I've even restarted a few times.

cl1# ./gpumodeswitch --gpumode graphics --auto

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00
Adapter: Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page


Programming UPR setting for requested mode..
License image updated successfully.

Programming ECC setting for requested mode..
The display may go *BLANK* on and off for up to 10 seconds or more during the update process depending on your display adapter and output device.

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
NOTE: Preserving straps from original image.
Clearing original firmware image...
Storing updated firmware image...
.................
Verifying update...
Update successful.

Firmware image has been updated from version 84.00.48.00.01 to 84.00.48.00.01.

A reboot is required for the update to take effect.

InfoROM image updated successfully.

cl1# ./gpumodeswitch --version                

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00
Adapter: Tesla M40            (10DE,17FD,10DE,1171) H:--:NRM S:00,B:03,PCI,D:00,F:00

Identifying EEPROM...
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
InfoROM Version : G600.0202.02.01

Tesla M40        (10DE,17FD,10DE,1171) --:NRM 84.00.48.00.01
InfoROM Version  : G600.0202.02.01
GPU Mode         : Compute

@haywoodspartan
Copy link

haywoodspartan commented Jan 25, 2022

I will have to get back to this project as my workload as of late has required me to deploy OpenStack Xena on my homelab setup for Work purposes. However OpenStack does allow for MDev devices and NVidia vGPU Virtual machines on a KVM Type system

@FallingSnow
Copy link

I figured it out. My vbios was out of date. lspci now shows a 256MB bar partition.

$ lspci -v
03:00.0 VGA compatible controller: NVIDIA Corporation GM200GL [Tesla M40] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GM200GL [Tesla M40]
	Flags: bus master, fast devsel, latency 0, IRQ 69, IOMMU group 0
	Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 7fe0000000 (64-bit, prefetchable) [size=256M]
	Memory at 7ff0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at f000 [size=128]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

@haywoodspartan I'm using the 12GB variant. I was able to split the GPU and spoof a M6000 instance to my VM. However I'm battling the dreaded code 43 right now.

@FallingSnow
Copy link

I ended up giving up on the M40, kept unloading the guest driver. I put in a 1070 ti and it worked perfectly.

@TymanLS
Copy link

TymanLS commented Feb 11, 2022

In case this helps at all, I'm observing the same code 43 behavior when using a Tesla M40 12GB; however, I am using the Merged-Rust-Drivers which uses the Rust-based vGPU unlock. I'm not sure how much of this information applies specifically to this codebase, but hopefully it can provide some insight into the process of unlocking vGPU in general.

I am testing on a Proxmox 7.1 OS with the 5.11 kernel manually installed, since the kernel patches for the merged driver didn't work for the 5.13 kernel (and I was running into some other unrelated Matrox video card bugs with the 5.15 kernel). With a few tweaks here and there, I was able to get to where the text "vGPU unlock patch applied" shows up in the output of dmesg, mdevctl types showed the list of available vGPU types for the Tesla M60, and I was able to create a GRID M60-4Q vGPU instance and assign it to a VM. However, I am running into issues seemingly with the guest driver in a Windows VM, where it will either return a code 43 error or BSOD when trying to install/load the driver; if I recall correctly, the BSOD errors were pretty much always SYSTEM_SERVICE_EXCEPTION (nvlddmkm.sys).

The GRID guest driver (list of them mentioned here) gave a code 43 error when I tried it. Since the merged driver was based on the 460.73.01 Linux driver, I chose the 462.31 Windows GRID guest driver which corresponds to the same GRID version (12.2) according to NVIDIA's website. I also tried spoofing the vGPU's PCI ID within the VM by specifying the x-pci-vendor-id and x-pci-device-id parameters in the QEMU configuration file. I spoofed a Quadro M6000 like @FallingSnow, but the normal Quadro M6000 drivers would also code 43. I tried multiple versions of the Quadro drivers including multiple 46x versions, a 47x version, and the latest version; none worked, and they all either gave a code 43 or BSOD. Additionally I tried to spoof a GTX 980, since I thought that card would be the closest to the GRID M60-4Q vGPU I was using; the GTX 980 used a GM204 GPU like the Tesla M60, and it came standard with 4GB of VRAM. Once again I got a code 43 error when trying to use the standard GeForce drivers for the GTX 980.

Another thing to note is that I have not made any changes to my VBIOS since getting the card. I did get it off eBay though, so I suppose anything is possible. I also did NOT attempt to set the GPU into graphics mode, my output of lspci -v shows a 16GB instead of a 256MB bar partition.

I am very interested in the configuration that @haywoodspartan described that allowed him to get vGPU working. From what I've researched so far (not much), I've only ever heard of two instances of a Tesla M40 being successfully used with vGPU: haywoodspartan's post in this issue thread, and Jeff from CraftComputing in this clip (though he also mentioned the 8GB VRAM limit). Notably, both of these instances were using the 24GB variant of the Tesla M40.

Let me know if there is any testing I can do to help assist with the project, I would absolutely love to get this Tesla M40 working in some remote gaming desktops!

@FallingSnow
Copy link

@republicus You have an M40 24GB right? It's the M40 12GB that doesn't work.

@dulasau
Copy link

dulasau commented Aug 20, 2022

I have the same problem with M40 12GB, I was able pass it to my guest Windows 11 (Proxmox 7.2) and with Quadro m6000 guest drivers I was able to make it work and get OK score in Heaven Benchmark. But every time I try to use it as vGPU i'm getting BSOD.
I going to try to play with 24GB version probably next week.

@dulasau
Copy link

dulasau commented Aug 20, 2022

BTW, in order to get a video output even through Parsec or TightVPN i have to use my gtx950 as additional GPU, any workarounds with this?

@haywoodspartan
Copy link

Have you set the GPU from Compute to Graphics mode. Apparently it may or may not persist after reboots according to some people.

https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf
This would need to be done on the host and individual virtual machines.

There is also the fact you need to have a virtual Display adapter on installed into windows. The parsec one can work fine in most cases.

@TymanLS
Copy link

TymanLS commented Aug 20, 2022

@dulasau This guide may be helpful. The person in this video seems to be using the M40 in a physical machine instead of a VM, hence why they have to install their iGPU drivers. If you're planning to only connect remotely with Parsec, you shouldn't need to install any iGPU drivers.

@dulasau
Copy link

dulasau commented Aug 20, 2022

@TymanLS I'm using Ryzen 5900x so unfortunately no iGPU

@dulasau
Copy link

dulasau commented Aug 20, 2022

@haywoodspartan Yeah i did switched it to Graphics mode (it persists after reboots), although I've done this only on the guest machine, I don't even load host nvidia drivers since I wasn't able to make vGPU work. Or you mean I need to enable Graphics mode on the host to make vGPU work (I think I tried that and it didn't help).
Parsec provides an optional virtual display for "headless" machines, but it didn't help either.

@TymanLS
Copy link

TymanLS commented Aug 20, 2022

@dulasau If you're passing the M40 straight through to a VM (not using vGPU), then I don't think the host drivers matter since the host system shouldn't be able to access the card. When you say you have to use the GTX 950, are you also passing that through to the VM or are you leaving that connected to the host system? I remember successfully setting up a Windows 10 VM with Parsec connectivity only passing through the M40 and no other GPUs, so I'm curious why it wouldn't work for you.

@dulasau
Copy link

dulasau commented Aug 20, 2022

@TymanLS I'm passing thought my GTX950 directly to the VM.
It's interesting that even thought I'm using spoofing (from Qemu) to trick that the card is Quadro M6000, Nvidia driver doesn't believe me, despite Windows device manager saying that the GPU is M6000 Nvidia driver (and games) say that it's M40. Maybe Nvidia driver knowing that M40 doesn't have video output blocks it somehow (the video output for Parsec, TightVNC, etc)?
One time Nvidia driver (and RTX Experience) agreed that the card is M6000 but it probably was an older driver (I should try this again), but I didn't test Parsec at that time.
One good thing from using GTX950 is that I can use Geforce Experience, I haven't had luck yet with Parsec client on Raspberry Pi 4 :)

@republicus
Copy link

@FallingSnow Yes, youre right. I have a 12 GB version that the seller said was last flashed with a TITAN X vbios. I'll see what, if anything, the vbios might do and report back any lessons learned.

@dulasau I am seeing the same behavior on Linux guests. The driver seems to recognize that it is a vGPU even when spoofed.

@dulasau
Copy link

dulasau commented Aug 25, 2022

Hmm.... ok the same BSOD with 24GB version, something is wrong .....

@FallingSnow
Copy link

Check your dmesg.

@dulasau
Copy link

dulasau commented Aug 25, 2022

Check your dmesg.

Am I looking for something specific?

@FallingSnow
Copy link

Any errors really about why vgpu might be failing.

@dulasau
Copy link

dulasau commented Aug 26, 2022

I don't see any errors related to vgpu

@angst911
Copy link

angst911 commented Sep 3, 2022

I have a Tesla M40 working well, using both this repo and and vgpu_unlock_rs. 510.47.03 Driver on Proxmox host.

Like many people have done, my working config passes through to Windows guests an NVIDIA Quadro M6000. It works great. I do not experience any error 43 issues or problems with performance or drivers on Windows 10 or 11.

What brought me here was my attempt to get Linux guests to enjoy the same benefits.

After some tweaks the only way I can get linux working at all is to pass-through a specific GRID device. An unchanged device with no PCI ID changes passed through an M60 -- which would not work with any proprietary NVIDIA drivers.

After changing the PCI IDs -- the Linux guest works great until the official driver goes into limp mode (triggered at 20 minutes uptime and slows the freq and sets a 15 FPS cap). I observe the same performance with the Windows driver going into limp mode when using the unlicensed official vGPU driver for Windows.

The PCI ID that works in Linux guests: [code]# PCI ID for GRID M60 0B #pci_id = 0x13F2114E #pci_device_id = 0x13F2[/code]

It would appear, unlike the Windows drivers, that the Linux proprietary drivers for Quadro and Tesla/Compute cards do not share the same instructions for vGPU capabilities. I have tried a series of different PCI IDs and drivers with no joy.

I'd love to know what steps/process you followed. I've been beating my head against the wall for 2 days now on this project. I've got two M40's that I'm trying to use as as vGPU (this mod plus -RS). Thinks "look" right, but I always get Error 43. I'm using the same driver version, and Proxmox 7.2

Can you share you VM config also?

@angst911
Copy link

angst911 commented Sep 3, 2022

Where did you get the patches for the kernel versions?

@dulasau
Copy link

dulasau commented Sep 4, 2022

I'd love to know what steps/process you followed. I've been beating my head against the wall for 2 days now on this project. I've got two M40's that I'm trying to use as as vGPU (this mod plus -RS). Thinks "look" right, but I always get Error 43. I'm using the same driver version, and Proxmox 7.2

Can you share you VM config also?

+1

@angst911
Copy link

angst911 commented Sep 4, 2022

Make sure secure boot is disabled in the UEFI BIOS
The story that got me to this....

I followed this guide originally https://wvthoog.nl/proxmox-7-vgpu-v2/ using the pre-patched Everything worked except Error 43..
then swapped over to using the video guide from Craft Computer (https://www.youtube.com/watch?v=jTXPMcBqoi8&t=1626s)

I had all sorts of fun manually patching the 510 driver set for the 5.15 kernel, which maybe I didn't need to...

just about gave up and decided to do a debian VM, disabled the custom profiles (by renaming the toml file at /etc/vgpu_profiles) and stopped spoofing to a quadro M6000 and installed the grid driver in debian, which got me into errors about not being able to load the drm module, which led me to disabling secure boot.... did the same in windows (after having to expand my partition)... and magic... working with the Grid driver. Turned my custom profiles back on, uninstalled the grid driver, reinstalled the quadro desktop drivers.... now I'm at Error 31.. So, progress?

@angst911
Copy link

angst911 commented Sep 4, 2022

ok, now back to error 43 with the quadro drivers, but, this is still progress. I was getting error 43 with the GRID drivers previously also

@republicus
Copy link

@dulasau @angst911

I just want to point out again that I have the 24GB version of the Tesla M40. Earlier others indicated the problem may be related to the 12GB version only.

I can give more details if this isn't enough to get you going. Let me know how it goes.

  • First I installed the vgpu_unlock script onto my proxmox host.
  • Secondly, I like how vgpu_unlock-rs complements this repo. So I setup vgpu_unlock-rs onto the proxmox host as well.

Beyond that there are very few specific configurations needed for the VM.

Configuration changes to vm config:
Add line args: -uuid 00000000-0000-0000-0000-000000000XXX where XXX = VMID

Add your hardware to the VM in GUI. I used MDev Type nvidia-12 or whichever you wish as reported by mdevctl types and has available instances.

I then changed made changes to the MDev Type by creating/editing /etc/vgpu_unlock/profile_override.toml

[profile.nvidia-12]
num_displays = 1
display_width = 3840
display_height = 2160
max_pixels = 8294400
cuda_enabled = 1
frl_enabled = 144
framebuffer = 5905580032
pci_id = 0x17F011A0
pci_device_id = 0x17F0

This was enough to get my Tesla M40 vgpu profile working in Windows 10/11.
The device is spoofed as an Quadro M6000 and I increased most the MDev profile to test its capabilities (which I currently game in 4K daily with this working profile)

image

@angst911
Copy link

angst911 commented Sep 4, 2022

``> @dulasau @angst911

I just want to point out again that I have the 24GB version of the Tesla M40. Earlier others indicated the problem may be related to the 12GB version only.

I can give more details if this isn't enough to get you going. Let me know how it goes.

  • First I installed the vgpu_unlock script onto my proxmox host.
  • Secondly, I like how vgpu_unlock-rs complements this repo. So I setup vgpu_unlock-rs onto the proxmox host as well.

Beyond that there are very few specific configurations needed for the VM.

Configuration changes to vm config: Add line args: -uuid 00000000-0000-0000-0000-000000000XXX where XXX = VMID

Add your hardware to the VM in GUI. I used MDev Type nvidia-12 or whichever you wish as reported by mdevctl types and has available instances.

I then changed made changes to the MDev Type by creating/editing /etc/vgpu_unlock/profile_override.toml

[profile.nvidia-12] num_displays = 1 display_width = 3840 display_height = 2160 max_pixels = 8294400 cuda_enabled = 1 frl_enabled = 144 framebuffer = 5905580032 pci_id = 0x17F011A0 pci_device_id = 0x17F0

This was enough to get my Tesla M40 vgpu profile working in Windows 10/11. The device is spoofed as an Quadro M6000 and I increased most the MDev profile to test its capabilities (which I currently game in 4K daily with this working profile)

image

@republicus What version of proxmox, kernel, and nvidia driver are you on (both host and guest)? -- Note I can see the 512.78 in the screenshot for the guest -- Can you provide a link to that download, I wasn't able to find that on NVIDIA's site.

Which VM Type machine type and Bios/UEFI did you use?

Did you 100% follow the vgpu_unlock instructions, or did you follow the modified instructions for using it with vgpu_unlock?

I'm at the point where the GRID driver works, but error 43 if I used the quadro driver and spoof the device ID
Proxmox 7.2, Kenrnel 5.15
VGPU_unlock + vgpu)unlock_rs (Driver patched to include SRC and kbuild config line prior to running nvidia installer)
Host Driver: NVIDIA-Linux-x86_64-510.47.03-vgpu-kvm.run manually integrating kennel related driver patched
Guest Driver: 511.65_grid_win10_win11_server2016_server2019_server2022_64bit_international

Working grid vgpu_profile.tom
[profile.nvidia-18] num_displays = 1 display_width = 1920 display_height = 1080 max_pixels = 2073600 cuda_enabled = 1 frl_enabled = 60 framebuffer = 5905580032

and the profile that doesn't work when spoofing to a M6000
[profile.nvidia-18] num_displays = 1 display_width = 1920 display_height = 1080 max_pixels = 2073600 cuda_enabled = 1 frl_enabled = 60 framebuffer = 5905580032 pci_id = 0x17F011A0 pci_device_id = 0x17F0

@dulasau
Copy link

dulasau commented Sep 4, 2022

I have both 12GB and 24GB versions and the problems seems to be consistent across both of them.

@republicus
Copy link

republicus commented Sep 5, 2022

I first installed and had it working on my PVE 7.1 node but had a failure with my boot drive recently. I swapped in my backup drive which is currently running PVE 6.4 Kernel Version Linux 5.4.195-1-pve

I'll work on updating the node back to PVE 7.2+

Host grid driver: 510.47.03

You can DM me on Discord if you wish:


Show
Republicus#2744

@angst911 The NVIDIA Advanced Driver Search seems to be less "advanced" than the ordinary search - I'm seeing only old drivers listed (latest 473.81) using it.

Here is a direct link to that driver: NVIDIA RTX / QUADRO DESKTOP AND NOTEBOOK DRIVER RELEASE 510

@dulasau
Copy link

dulasau commented Sep 6, 2022

It's working!!!!!
Although not 100% sure exactly why :-D

I see hours of testing ahead, but here is what I have so far:

  1. It works on my "new" server (two e5-2698v3 and supermicro x10dri-t4+)
  2. It didn't work (one of things I'll test later) on my "old" "server" (Ryzen 5900x + ASRock x570d4u)
  3. Tesla m40 24gb. Going to try 12gb version tonight.
  4. Host OS: Proxmox 7.2-7 (kernel 5.15.39)
  5. Host driver: 510.85.03
  6. Guest OS: Two VMs with Win11
  7. Guest driver: 512.78 (from the post above). Was getting code 43 with 471.41

I was following this setup/config instruction https://gitlab.com/polloloco/vgpu-proxmox and profile config override from here https://drive.google.com/drive/folders/1KHf-vxzUCGqsWZWOW0bXCvMhXh5EJxQl (Jeff from Craft Computing).

@dulasau
Copy link

dulasau commented Sep 6, 2022

Just in case here is profile override:

[profile.nvidia-18]
num_displays = 1
display_width = 1920
display_height = 1080
max_pixels = 2073600
cuda_enabled = 1
frl_enabled = 60
framebuffer = 11811160064
pci_id = 0x17F011A0
pci_device_id = 0x17F0


VM config:

args: -uuid 00000000-0000-0000-0000-000000000104
balloon: 0
bios: ovmf
boot: order=ide0;ide2;net0
cores: 8
cpu: host
efidisk0: local-lvm:vm-104-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:81:00.0,mdev=nvidia-18,pcie=1
ide0: local-lvm:vm-104-disk-1,size=64G
ide2: NetworkBackup:iso/Win11_English_x64v1.iso,media=cdrom,size=5434622K
machine: pc-q35-7.0
memory: 12288
meta: creation-qemu=7.0.0,ctime=1662489026
name: Win11-3
net0: e1000=16:AB:A7:2D:FB:4B,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-pci
smbios1: uuid=b560b92f-f856-487e-bb00-a2e495665b59
sockets: 1
tpmstate0: local-lvm:vm-104-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 1fa5368d-a7d0-403b-ac65-e033af2de62a

@republicus
Copy link

Thats great! Hope to hear good news about the Tesla M40 12GB

@dulasau
Copy link

dulasau commented Sep 6, 2022

Tesla M40 12gb works as well. Changed profile override to ~6gb and was able to start two VMs
Screenshot from 2022-09-06 16-46-47

@dulasau
Copy link

dulasau commented Sep 7, 2022

Alrighty, I tested Tesla M40 12GB on my Ryzen based "server" and now it's working!
The only change from my unsuccessful previous attempts is that I have freshly installed Proxmox (although the same 7.2 version) on it (I was rebuilding my homelab) and probably guest nvidia driver 512.78 (i don't remember which driver version I was using before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants