-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675
Comments
thanks for the investigation! @sameerz This is the reason why mig-on-dataproc-2.1-ubuntu20 has been failing to initialize recently. |
Hello @yinqingh, I think you're using a different version of /gpu/mig.sh I’ll inform the repository maintainers about this inconsistency. Edit: Created issue GoogleCloudDataproc/initialization-actions#1259 |
Hi @SurajAralihalli , I tried with spark-rapids/mig.sh but it still failed in installing nvidia driver (535.104.05) with the same error. The dataproc runtime version is "2.1.73-ubuntu20".
|
more context at: GoogleCloudDataproc/initialization-actions#1259 |
I've drafted GoogleCloudDataproc/initialization-actions#1269 to fix the MIG scripts. I can see mig is enabled by querying |
Hello @yinqingh - I've been working on the mig script recently. Here's some news:
So this is to ask: do you need to use A100 GPUs? If I provided documentation on how to use H100 GPUs and got the mig script working with those instead, would that meet your needs? If you want a preview of the new script, I have a PR open here: It's a complete re-write. I am generating the script from templates with this revision so that the code shared between the action scripts in the initialization-actions repository stays in sync. You can fetch the current state of the script, which I've recently exercised here: Please keep in touch. I would like to have more contact with the users of these scripts. |
I'm about to run some tests to exercise the MIG script on H100s |
Okay. A100, aye. Tell me your image? 2.2-rocky9? either of the rocky8 images? I think it will be easiest on 2.1-rocky8 or 2.0-rocky8 but very briefly before upgrading to 2.2-rocky9 please. I will use the script in that pr as a guide, then thank you! |
We use |
Okay, thank you. I want to talk to NV to find out how to reset the a100 without a reboot. None of the other init actions reboot, and there is not really support for the process, last I checked. Secondary nodes, especially, may not perform their work if the init action reboots them. Although the customer will be charged, state may be lost and the init action may start from the beginning every time a reboot is initiated. The customer will be billed, but the secondary worker may never come online. I will put the reboot logic back in for this release to meet your use case, conditional on the mig script finding an a100 in /proc/bus/pci. This will put the system into an indeterminate state which will make the product more difficult to support. Nota Bene: requiring a reboot may eliminate support for secondary worker auto scaling. |
Suraj, Do you know if there is any way to reset the a100 successfully without a reboot? I see that nvidia-smi has a mechanism to list processes known to be utilizing the kernel driver. And repeatedly
The documentation on the Internet says "have you tried rebooting" like from the IT support desk front line. The answer is that the system I'm working with cannot be relied on to keep state on reboot. The hypervisor system on which the guest executes may change, and thus would have a new device attached. If there is any reliable way, without a reboot, to reset the hardware, power cycle the pci-e card, whatever is necessary, I would like to perform those steps rather than relying on a reboot. It extends the already lengthy time necessary to complete the initialization of the host. |
Describe the bug
Observed following error while running mig.sh on dataproc-2.1-ubuntu20 with runtime version "2.1.72-ubuntu20" and kernel version "5.15.0-1067-gcp".
Tried with some old dataproc runtime versions. It works with runtime version "2.1.40-ubuntu20" and kernel version "5.15.0-1049-gcp".
Steps/Code to reproduce bug
Expected behavior
succeed to run mig.sh
Environment details (please complete the following information)
2.1.72-ubuntu20
The text was updated successfully, but these errors were encountered: