-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesla M40 Problems & Memory Allocation Limit with Tesla M40 24GB -> Tesla M60 remapping #62
Comments
Code 43 is sort of Nvidia's catch-all error, it doesn't really provide any useful information. I think you have two options:
It seems that the M60 is quite special. If you compare the specs here: |
I'm having issues with a M40 too. Dmesg wasn't returning anything, but eventually I figured out I needed to go into hooks.c and turn on logging. Oddly though I still don't see any of the syslog stuff from the main script file anywhere in logs, but now I do at least see vGPU unlock patch applied. Remap called. I also saw 'nvidia-vgpu-mgr[4819]: op_type: 0xa0810115 failed.' Still error 43 in windows with 443.18 drivers, in Linux it says 'probe of 0000:01:00.0 failed with error -1'. Also just tried passing it right through without modifying IDs, then installing drivers that were bundled with the Linux VGPU drivers, it recognized it as an Nvidia GRID M60-2Q, but still failed with code 43. |
Well, I've made the decision to go ahead and return the card while I'm still inside of the return window. I'll likely still have the card for a day or two if there's anything specific I can try. As for what I found since the initial post: I made a Linux guest, which I'm admittedly not as familiar with running nvidia drivers for as I've only use Linux with Nvidia accelerate graphics on an older machine with a GTX 650. |
When you start a guest does nvidia-smi report anything under processes for you? For me on the M40 it always returns 'No running processes found'. |
Have you tried enabling verbose logging using
It is supposed to list a
These |
I enabled verbose logging and here's the log entries: This on repeat: And then this: This is with passing through the devid of a Quadro M6000, and using drivers that claim to support Quadro M6000s. Random thought: If you pass the M40 directly through to a virtual machine at first it doesn't work because it's in some kind of compute only mode, but then after you change the driver mode (nvidia-smi -g 0 -dm 0) it starts to function more like a regular GPU. It doesn't seem to be a persistent thing - eg it's saved somewhere in the registry of the Windows VM rather than somewhere on the card itself. In Linux nvidia-smi tells you the mode can't be changed. What if it's stuck in some kind of compute mode in Linux (but not in Windows), and that's why it isn't enabling vGPU since compute mode has to be off for the other Tesla cards before vGPU can be enabled? |
Two problems here:
So please try again without any PCI spoofing tricks in the qemu configuration and use an officially supported driver version.
There might be something to this, Nvidia provides the |
Well, I wasn't able to use nvidia-smi to tell, but I did try the gpumodeswitch tools. Sure enough, the card is in compute mode:
From there I was able to use Unfortunately, after this I reinstalled the driver but I'm still getting a code 43 in Windows and in Linux I'm still having trouble even installing the driver. I did finally realize that I need to blacklist nouveau but I'm still getting errors. Just running the installer normally, I get an error about the DRM-KMS module not being built correctly. I'm not sure if excluding that from the compilation would be a problem, but I gave it a shot with the I have a couple ideas on some more things I can try. I'll report back if I have any more positive changes. |
Well, I've tried my couple other ideas. Unfortunately, I had no luck with any of them either. I upgrade to host driver version 460.73.02. Tried the windows guest from there with 462.31 with no luck. Moved on to linux from there. I did finally get the driver to install, technically, (version 460.73.01 this time) but I did still need to use the [ 5.737475] nvidia: loading out-of-tree module taints kernel. I'm not really sure how assigning the PCI device ID works when you're normally passing a device through with vGPU, but I tried looking up the GRID M60-2Q profile that I'm using and found a result that said it should be 114e, so that's what I tried. Hopefully that right. Anyways, please let me know if there's anything else I can try out. |
Oh, forgot to mention that on the host side I keep getting messages that say: |
My Tesla M40 doesn't seem to be compatible with gpumodeswitch like yours. For me it says: Command id:000E Command: NV_UCODE_CMD_COMMAND_VV failed BCRT Error: Certificate 2.0 verification failed ERROR: BIOS Cert 2.0 Verifications Error, Update aborted. |
This is expected since Qemu/KVM does not support the migration feature of the vGPU drivers.
Ok, this is an improvement, the driver 460.73.01 is supported, but the PCI ID 10de:114e is weird. There is no NVIDIA device with that ID. Can you provide the output of
Are you assigning it manually? Why? If you insist on setting it yourself, it should be:
|
I am assigning manually, but only really out of ignorance for the 'normal' way you would do it. I'll change the parameters to match what you have there, at least.
Sure, see the attached files. |
I'm wondering if there's a possibility here that the GM200 chipset is just built, up and down the stack, not to support vGPU at a hardware level. Since it's marketed as incompatible on the only Tesla card that uses that silicon (the M40), and the other cards (980 Ti, Maxwell Titan X, and Quadro M6000) wouldn't be expected to have it work anyways, maybe it just totally locked out? I'm not sure if Nvidia would really go to such lengths to design it that way since it would deviate from their lower-tier designs. Do you know of any confirmed cases of someone getting vGPU working on a 980 Ti, Titan X, or M6K? |
These looks correct, the dive shows up as a VGA controller with a 256 MB BAR1, so it is in the correct graphics mode. And the device in the guest shows up with the correct PCI IDs.
I'm guessing that you are setting it either using the qemu command line with an argument like
so the "normal" way is to not pass those arguments/xml elements (i.e remove them).
There is a possibility that there exists some technical limitation that prevents this from working, yes. But vGPU is a software solution and doesn't rely on the existence of some special hardware feature to function, however if the hardware is special in its design (like the GTX 970's 3.5+0.5 memory layout) it might be incompatible.
I do not.
Now it looks correct, does the driver still complain about the device being unsupported? |
Well, specifically nvidia-smi says: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. I'm not sure how to check the driver itself. I'm not sure what the service name is for the normal linux guest drivers and a quick Google didn't really give me anything nor did tab completion give me anything with systemctl status nvi[tab here]. Otherwise I would try to check the driver itself instead of just SMI. I tried reinstalling the driver, too, once I reset the IDs to what you'd said and it still complained about DRM-KMS, so I did have to use the |
It should install without the What error does the installer give that prevents it from installing without |
When looking around I noticed that there are two 460.73.01 drivers, you want the -grid version, check the checksum.
I also found that Google publishes the files here https://cloud.google.com/compute/docs/gpus/grid-drivers-table The non-grid version explicitly lists the 24GB M40 as supported, so I do not understand why it refuses to work. |
Well, I was able to load the GRID version of the driver without any errors during install. Having done so now has nvidia-smi giving the very uninteresting output of With the GRID driver installed I get an output for |
Same query on the host give... this. |
It looks like it tries to load now (then fails, then tries again, ...). But I can't see any error being printed, can you provide the log without the |
Unfortunately this will probably be my last post regarding the M40 - maybe someone else can pick this up in the future, but I've got to return mine now and I've got an M60 in that I can try out instead. Thank you for all the help with this though! Here are the full guest and host dmesg logs in case they reveal something useful: |
I was able to get the vGPU to split with the Tesla M40 24GB with Proxmox Host and the vgpu_unlock script. If needed I had to use a hacky way of doing it with a spoof on the vgpu itself. However I am limited to only doing 1 vgpu on this card at any given time for some reason. I wonder if there was a way to give it more availability since I have the VRAM to do it and I have ECC disabled on the card. For testing purposes I have this on my home server behind a load balancer. If the dev wants to mess around with it he can if he needs a working debug environment. |
That's interesting, did you see the same issues as reported previously in this issue? Do you mind sharing details on this "hacky way"?
I assume this means you were able to create a single vGPU instance, assign it to a VM, and then load the drivers inside the VM to get hardware acceleration. This would be good news for the Tesla M40. In order to use multiple instances at the same time you should check the following:
If you can provide error messages or log files that would be helpful too. |
@BlaringIce How did you get your M40 into graphics mode? Mine won't seem to switch. I've even restarted a few times.
|
I will have to get back to this project as my workload as of late has required me to deploy OpenStack Xena on my homelab setup for Work purposes. However OpenStack does allow for MDev devices and NVidia vGPU Virtual machines on a KVM Type system |
I figured it out. My vbios was out of date. lspci now shows a 256MB bar partition.
@haywoodspartan I'm using the 12GB variant. I was able to split the GPU and spoof a M6000 instance to my VM. However I'm battling the dreaded code 43 right now. |
I ended up giving up on the M40, kept unloading the guest driver. I put in a 1070 ti and it worked perfectly. |
In case this helps at all, I'm observing the same code 43 behavior when using a Tesla M40 12GB; however, I am using the Merged-Rust-Drivers which uses the Rust-based vGPU unlock. I'm not sure how much of this information applies specifically to this codebase, but hopefully it can provide some insight into the process of unlocking vGPU in general. I am testing on a Proxmox 7.1 OS with the 5.11 kernel manually installed, since the kernel patches for the merged driver didn't work for the 5.13 kernel (and I was running into some other unrelated Matrox video card bugs with the 5.15 kernel). With a few tweaks here and there, I was able to get to where the text "vGPU unlock patch applied" shows up in the output of The GRID guest driver (list of them mentioned here) gave a code 43 error when I tried it. Since the merged driver was based on the 460.73.01 Linux driver, I chose the 462.31 Windows GRID guest driver which corresponds to the same GRID version (12.2) according to NVIDIA's website. I also tried spoofing the vGPU's PCI ID within the VM by specifying the Another thing to note is that I have not made any changes to my VBIOS since getting the card. I did get it off eBay though, so I suppose anything is possible. I also did NOT attempt to set the GPU into graphics mode, my output of I am very interested in the configuration that @haywoodspartan described that allowed him to get vGPU working. From what I've researched so far (not much), I've only ever heard of two instances of a Tesla M40 being successfully used with vGPU: haywoodspartan's post in this issue thread, and Jeff from CraftComputing in this clip (though he also mentioned the 8GB VRAM limit). Notably, both of these instances were using the 24GB variant of the Tesla M40. Let me know if there is any testing I can do to help assist with the project, I would absolutely love to get this Tesla M40 working in some remote gaming desktops! |
@republicus You have an M40 24GB right? It's the M40 12GB that doesn't work. |
I have the same problem with M40 12GB, I was able pass it to my guest Windows 11 (Proxmox 7.2) and with Quadro m6000 guest drivers I was able to make it work and get OK score in Heaven Benchmark. But every time I try to use it as vGPU i'm getting BSOD. |
BTW, in order to get a video output even through Parsec or TightVPN i have to use my gtx950 as additional GPU, any workarounds with this? |
Have you set the GPU from Compute to Graphics mode. Apparently it may or may not persist after reboots according to some people. https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf There is also the fact you need to have a virtual Display adapter on installed into windows. The parsec one can work fine in most cases. |
@dulasau This guide may be helpful. The person in this video seems to be using the M40 in a physical machine instead of a VM, hence why they have to install their iGPU drivers. If you're planning to only connect remotely with Parsec, you shouldn't need to install any iGPU drivers. |
@TymanLS I'm using Ryzen 5900x so unfortunately no iGPU |
@haywoodspartan Yeah i did switched it to Graphics mode (it persists after reboots), although I've done this only on the guest machine, I don't even load host nvidia drivers since I wasn't able to make vGPU work. Or you mean I need to enable Graphics mode on the host to make vGPU work (I think I tried that and it didn't help). |
@dulasau If you're passing the M40 straight through to a VM (not using vGPU), then I don't think the host drivers matter since the host system shouldn't be able to access the card. When you say you have to use the GTX 950, are you also passing that through to the VM or are you leaving that connected to the host system? I remember successfully setting up a Windows 10 VM with Parsec connectivity only passing through the M40 and no other GPUs, so I'm curious why it wouldn't work for you. |
@TymanLS I'm passing thought my GTX950 directly to the VM. |
@FallingSnow Yes, youre right. I have a 12 GB version that the seller said was last flashed with a TITAN X vbios. I'll see what, if anything, the vbios might do and report back any lessons learned. @dulasau I am seeing the same behavior on Linux guests. The driver seems to recognize that it is a vGPU even when spoofed. |
Hmm.... ok the same BSOD with 24GB version, something is wrong ..... |
Check your |
Am I looking for something specific? |
Any errors really about why vgpu might be failing. |
I don't see any errors related to vgpu |
I'd love to know what steps/process you followed. I've been beating my head against the wall for 2 days now on this project. I've got two M40's that I'm trying to use as as vGPU (this mod plus -RS). Thinks "look" right, but I always get Error 43. I'm using the same driver version, and Proxmox 7.2 Can you share you VM config also? |
Where did you get the patches for the kernel versions? |
+1 |
Make sure secure boot is disabled in the UEFI BIOS I followed this guide originally https://wvthoog.nl/proxmox-7-vgpu-v2/ using the pre-patched Everything worked except Error 43.. I had all sorts of fun manually patching the 510 driver set for the 5.15 kernel, which maybe I didn't need to... just about gave up and decided to do a debian VM, disabled the custom profiles (by renaming the toml file at /etc/vgpu_profiles) and stopped spoofing to a quadro M6000 and installed the grid driver in debian, which got me into errors about not being able to load the drm module, which led me to disabling secure boot.... did the same in windows (after having to expand my partition)... and magic... working with the Grid driver. Turned my custom profiles back on, uninstalled the grid driver, reinstalled the quadro desktop drivers.... now I'm at Error 31.. So, progress? |
ok, now back to error 43 with the quadro drivers, but, this is still progress. I was getting error 43 with the GRID drivers previously also |
I just want to point out again that I have the 24GB version of the Tesla M40. Earlier others indicated the problem may be related to the 12GB version only. I can give more details if this isn't enough to get you going. Let me know how it goes.
Beyond that there are very few specific configurations needed for the VM. Configuration changes to vm config: Add your hardware to the VM in GUI. I used MDev Type nvidia-12 or whichever you wish as reported by I then changed made changes to the MDev Type by creating/editing
This was enough to get my Tesla M40 vgpu profile working in Windows 10/11. |
@republicus What version of proxmox, kernel, and nvidia driver are you on (both host and guest)? -- Note I can see the 512.78 in the screenshot for the guest -- Can you provide a link to that download, I wasn't able to find that on NVIDIA's site. Which VM Type machine type and Bios/UEFI did you use? Did you 100% follow the vgpu_unlock instructions, or did you follow the modified instructions for using it with vgpu_unlock? I'm at the point where the GRID driver works, but error 43 if I used the quadro driver and spoof the device ID Working grid vgpu_profile.tom and the profile that doesn't work when spoofing to a M6000 |
I have both 12GB and 24GB versions and the problems seems to be consistent across both of them. |
I first installed and had it working on my PVE 7.1 node but had a failure with my boot drive recently. I swapped in my backup drive which is currently running PVE 6.4 Kernel Version Linux 5.4.195-1-pve I'll work on updating the node back to PVE 7.2+ Host grid driver: 510.47.03 You can DM me on Discord if you wish: ShowRepublicus#2744 @angst911 The NVIDIA Advanced Driver Search seems to be less "advanced" than the ordinary search - I'm seeing only old drivers listed (latest 473.81) using it. Here is a direct link to that driver: NVIDIA RTX / QUADRO DESKTOP AND NOTEBOOK DRIVER RELEASE 510 |
It's working!!!!! I see hours of testing ahead, but here is what I have so far:
I was following this setup/config instruction https://gitlab.com/polloloco/vgpu-proxmox and profile config override from here https://drive.google.com/drive/folders/1KHf-vxzUCGqsWZWOW0bXCvMhXh5EJxQl (Jeff from Craft Computing). |
Just in case here is profile override: [profile.nvidia-18] VM config: args: -uuid 00000000-0000-0000-0000-000000000104 |
Thats great! Hope to hear good news about the Tesla M40 12GB |
Alrighty, I tested Tesla M40 12GB on my Ryzen based "server" and now it's working! |
First and primary:
I'm coming from a setup where I was using a GTX 1060 with vgpu_unlock just fine, but figured I'd step it up so that I could support more VMs. So, I'm currently trying to use a Tesla M40. Being a Tesla card, you might expect not to need vgpu_unlock, but this is one of the few Tesla's that doesn't support it natively. So, I'm trying to use nvidia-18 types from the M60 profiles with my VMs. I'm aware that I should be using a slightly older driver to match my host driver. However, I'm still getting a code 43 when I load my guest. I would provide some logs here, but I'm not sure what I can include since the entries for the two vgpu services both seem to be fine with no errors other than
nvidia-vgpu-mgr[2588]: notice: vmiop_log: display_init inst: 0 successful
at the end of trying to initialize the mdev device when the VM starts up. Please let me know any other information that I can provide to help debug/troubleshoot.Second:
This is probably one of the few instances where this is a problem since most GeForce/Quadro cards have less memory than their vGPU capable counterparts. However, I have a Tesla M40 GPU that has 24 GB of vRAM (in two separate memory regions I would guess, although this SKU isn't listed on the Nvidia graphics processing units Wikipedia page, so I'm not 100% sure). This is in comparison to the Tesla M60's 2x8GB configuration, of which, only 8GB is available for allocation in vGPU.
I'm not sure whether the max_instance quantity, as seen in mdevctl types, is defined on the Nvidia driver side, in the vgpu_unlock side, or if it's a mix and the vgpu_unlock side might be able to do something about it.
What I'm asking here, though, is whether this value can be redefined so that I can utilize all 24 GB of my available vRAM or, if not that, then at least the 12 GB that I presume is available in the GPU's primary memory.
The text was updated successfully, but these errors were encountered: