Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

Open
yinqingh opened this issue Oct 30, 2024 · 12 comments
Open

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

yinqingh opened this issue Oct 30, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@yinqingh
Copy link
Collaborator

yinqingh commented Oct 30, 2024

Describe the bug
Observed following error while running mig.sh on dataproc-2.1-ubuntu20 with runtime version "2.1.72-ubuntu20" and kernel version "5.15.0-1067-gcp".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/495.29.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/495.29.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/495.29.05/build/Module.symvers] Error 1

Tried with some old dataproc runtime versions. It works with runtime version "2.1.40-ubuntu20" and kernel version "5.15.0-1049-gcp".

Steps/Code to reproduce bug

  1. Create dataproc cluster using MIG with nvidia-tesla-a100 gpu and runtime version "2.1.72-ubuntu20"
  2. ssh to gpu node
  3. download mig.sh
  4. sudo bash mig.sh

Expected behavior
succeed to run mig.sh

Environment details (please complete the following information)

  • Environment location: Dataproc, version 2.1.72-ubuntu20
@yinqingh yinqingh added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 30, 2024
@pxLi
Copy link
Collaborator

pxLi commented Oct 30, 2024

thanks for the investigation!

@sameerz This is the reason why mig-on-dataproc-2.1-ubuntu20 has been failing to initialize recently.

@yinqingh yinqingh changed the title [BUG] Failed to run mig.sh on dataproc-2.1-ubuntu20 [BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 Oct 30, 2024
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 5, 2024
@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Nov 7, 2024

Hello @yinqingh, I think you're using a different version of /gpu/mig.sh
Can you try with /spark-rapids/mig.sh?

I’ll inform the repository maintainers about this inconsistency.

Edit: Created issue GoogleCloudDataproc/initialization-actions#1259

@yinqingh
Copy link
Collaborator Author

yinqingh commented Nov 8, 2024

Hi @SurajAralihalli , I tried with spark-rapids/mig.sh but it still failed in installing nvidia driver (535.104.05) with the same error. The dataproc runtime version is "2.1.73-ubuntu20".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.104.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/535.104.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.104.05/build/Module.symvers] Error 1
 make[2]: *** Deleting file '/var/lib/dkms/nvidia/535.104.05/build/Module.symvers'
 make[1]: *** [Makefile:1829: modules] Error 2
 make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-1070-gcp'
 make: *** [Makefile:82: modules] Error 2
DKMSKernelVersion: 5.15.0-1070-gcp
Date: Fri Nov  8 09:07:43 2024
Package: nvidia-dkms-535 535.104.05-0ubuntu1
PackageVersion: 535.104.05-0ubuntu1
SourcePackage: nvidia-graphics-drivers-535
Title: nvidia-dkms-535 535.104.05-0ubuntu1: nvidia kernel module failed to build

@pxLi
Copy link
Collaborator

pxLi commented Dec 3, 2024

more context at: GoogleCloudDataproc/initialization-actions#1259

@SurajAralihalli
Copy link
Collaborator

I've drafted GoogleCloudDataproc/initialization-actions#1269 to fix the MIG scripts. I can see mig is enabled by querying nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader | grep Enabled on the cluster. Can you please use https://github.com/SurajAralihalli/initialization-actions/blob/fix_mig/spark-rapids/mig.sh and confirm if it works for your use case.

@cjac
Copy link

cjac commented Jan 7, 2025

Hello @yinqingh - I've been working on the mig script recently. Here's some news:

  • Open kernel drivers do not support A100
  • Proprietary kernel drivers do not compile on new kernels
  • I have been exercising the mig script with H100 devices, and these work the same way on GCP as the A100s as far as my tests are concerned.

So this is to ask: do you need to use A100 GPUs? If I provided documentation on how to use H100 GPUs and got the mig script working with those instead, would that meet your needs?

If you want a preview of the new script, I have a PR open here:
https://github.com/GoogleCloudDataproc/initialization-actions/pull/1284/files#diff-dbc9224bc43a19c64fa878694edbef41698d2bf1133af921d7b2a2e0da417f83

It's a complete re-write. I am generating the script from templates with this revision so that the code shared between the action scripts in the initialization-actions repository stays in sync. You can fetch the current state of the script, which I've recently exercised here:

https://github.com/GoogleCloudDataproc/initialization-actions/raw/883e6a2664c077edab176614a14d774c23009497/spark-rapids/mig.sh

Please keep in touch. I would like to have more contact with the users of these scripts.

@cjac
Copy link

cjac commented Jan 10, 2025

I'm about to run some tests to exercise the MIG script on H100s

@yinqingh
Copy link
Collaborator Author

Hi @cjac , thanks for the new script! But we actually need to use A100. We are currently using the mig.sh in this PR in our test jobs and it works well so far.

@cjac
Copy link

cjac commented Jan 10, 2025

Okay. A100, aye. Tell me your image? 2.2-rocky9? either of the rocky8 images?

I think it will be easiest on 2.1-rocky8 or 2.0-rocky8 but very briefly before upgrading to 2.2-rocky9 please.

I will use the script in that pr as a guide, then thank you!

@yinqingh
Copy link
Collaborator Author

We use 2.1-debian11 and 2.1-ubuntu20 in our job

@cjac
Copy link

cjac commented Jan 11, 2025

We use 2.1-debian11 and 2.1-ubuntu20 in our job

Okay, thank you. I want to talk to NV to find out how to reset the a100 without a reboot. None of the other init actions reboot, and there is not really support for the process, last I checked. Secondary nodes, especially, may not perform their work if the init action reboots them. Although the customer will be charged, state may be lost and the init action may start from the beginning every time a reboot is initiated. The customer will be billed, but the secondary worker may never come online.

I will put the reboot logic back in for this release to meet your use case, conditional on the mig script finding an a100 in /proc/bus/pci. This will put the system into an indeterminate state which will make the product more difficult to support. Nota Bene: requiring a reboot may eliminate support for secondary worker auto scaling.

@cjac
Copy link

cjac commented Jan 11, 2025

I've drafted GoogleCloudDataproc/initialization-actions#1269 to fix the MIG scripts. I can see mig is enabled by querying nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader | grep Enabled on the cluster. Can you please use https://github.com/SurajAralihalli/initialization-actions/blob/fix_mig/spark-rapids/mig.sh and confirm if it works for your use case.

Suraj,

Do you know if there is any way to reset the a100 successfully without a reboot? I see that nvidia-smi has a mechanism to list processes known to be utilizing the kernel driver. And repeatedly rmmod'ing each module in sequence about five times usually results in successful removal of the symbols from the kernel. I would expect that running nvidia-smi --reset-gpu from a system without the module loaded would

  • Load the module
  • Initialize the device
  • Return $?==0

The documentation on the Internet says "have you tried rebooting" like from the IT support desk front line. The answer is that the system I'm working with cannot be relied on to keep state on reboot. The hypervisor system on which the guest executes may change, and thus would have a new device attached. If there is any reliable way, without a reboot, to reset the hardware, power cycle the pci-e card, whatever is necessary, I would like to perform those steps rather than relying on a reboot. It extends the already lengthy time necessary to complete the initialization of the host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants