[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

yinqingh · 2024-10-30T10:01:29Z

Describe the bug
Observed following error while running mig.sh on dataproc-2.1-ubuntu20 with runtime version "2.1.72-ubuntu20" and kernel version "5.15.0-1067-gcp".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/495.29.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/495.29.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/495.29.05/build/Module.symvers] Error 1

Tried with some old dataproc runtime versions. It works with runtime version "2.1.40-ubuntu20" and kernel version "5.15.0-1049-gcp".

Steps/Code to reproduce bug

Create dataproc cluster using MIG with nvidia-tesla-a100 gpu and runtime version "2.1.72-ubuntu20"
ssh to gpu node
download mig.sh
sudo bash mig.sh

Expected behavior
succeed to run mig.sh

Environment details (please complete the following information)

Environment location: Dataproc, version 2.1.72-ubuntu20

The text was updated successfully, but these errors were encountered:

pxLi · 2024-10-30T10:07:26Z

thanks for the investigation!

@sameerz This is the reason why mig-on-dataproc-2.1-ubuntu20 has been failing to initialize recently.

SurajAralihalli · 2024-11-07T23:15:55Z

Hello @yinqingh, I think you're using a different version of /gpu/mig.sh
Can you try with /spark-rapids/mig.sh?

I’ll inform the repository maintainers about this inconsistency.

Edit: Created issue GoogleCloudDataproc/initialization-actions#1259

yinqingh · 2024-11-08T09:23:06Z

Hi @SurajAralihalli , I tried with spark-rapids/mig.sh but it still failed in installing nvidia driver (535.104.05) with the same error. The dataproc runtime version is "2.1.73-ubuntu20".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.104.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/535.104.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.104.05/build/Module.symvers] Error 1
 make[2]: *** Deleting file '/var/lib/dkms/nvidia/535.104.05/build/Module.symvers'
 make[1]: *** [Makefile:1829: modules] Error 2
 make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-1070-gcp'
 make: *** [Makefile:82: modules] Error 2
DKMSKernelVersion: 5.15.0-1070-gcp
Date: Fri Nov  8 09:07:43 2024
Package: nvidia-dkms-535 535.104.05-0ubuntu1
PackageVersion: 535.104.05-0ubuntu1
SourcePackage: nvidia-graphics-drivers-535
Title: nvidia-dkms-535 535.104.05-0ubuntu1: nvidia kernel module failed to build

pxLi · 2024-12-03T02:16:07Z

more context at: GoogleCloudDataproc/initialization-actions#1259

SurajAralihalli · 2024-12-04T11:55:28Z

I've drafted GoogleCloudDataproc/initialization-actions#1269 to fix the MIG scripts. I can see mig is enabled by querying nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader | grep Enabled on the cluster. Can you please use https://github.com/SurajAralihalli/initialization-actions/blob/fix_mig/spark-rapids/mig.sh and confirm if it works for your use case.

cjac · 2025-01-07T22:24:34Z

Hello @yinqingh - I've been working on the mig script recently. Here's some news:

Open kernel drivers do not support A100
Proprietary kernel drivers do not compile on new kernels
I have been exercising the mig script with H100 devices, and these work the same way on GCP as the A100s as far as my tests are concerned.

So this is to ask: do you need to use A100 GPUs? If I provided documentation on how to use H100 GPUs and got the mig script working with those instead, would that meet your needs?

If you want a preview of the new script, I have a PR open here:
https://github.com/GoogleCloudDataproc/initialization-actions/pull/1284/files#diff-dbc9224bc43a19c64fa878694edbef41698d2bf1133af921d7b2a2e0da417f83

It's a complete re-write. I am generating the script from templates with this revision so that the code shared between the action scripts in the initialization-actions repository stays in sync. You can fetch the current state of the script, which I've recently exercised here:

https://github.com/GoogleCloudDataproc/initialization-actions/raw/883e6a2664c077edab176614a14d774c23009497/spark-rapids/mig.sh

Please keep in touch. I would like to have more contact with the users of these scripts.

cjac · 2025-01-10T01:27:54Z

I'm about to run some tests to exercise the MIG script on H100s

yinqingh · 2025-01-10T04:02:41Z

Hi @cjac , thanks for the new script! But we actually need to use A100. We are currently using the mig.sh in this PR in our test jobs and it works well so far.

cjac · 2025-01-10T05:34:03Z

Okay. A100, aye. Tell me your image? 2.2-rocky9? either of the rocky8 images?

I think it will be easiest on 2.1-rocky8 or 2.0-rocky8 but very briefly before upgrading to 2.2-rocky9 please.

I will use the script in that pr as a guide, then thank you!

yinqingh · 2025-01-10T06:06:39Z

We use 2.1-debian11 and 2.1-ubuntu20 in our job

cjac · 2025-01-11T07:26:12Z

We use 2.1-debian11 and 2.1-ubuntu20 in our job

Okay, thank you. I want to talk to NV to find out how to reset the a100 without a reboot. None of the other init actions reboot, and there is not really support for the process, last I checked. Secondary nodes, especially, may not perform their work if the init action reboots them. Although the customer will be charged, state may be lost and the init action may start from the beginning every time a reboot is initiated. The customer will be billed, but the secondary worker may never come online.

I will put the reboot logic back in for this release to meet your use case, conditional on the mig script finding an a100 in /proc/bus/pci. This will put the system into an indeterminate state which will make the product more difficult to support. Nota Bene: requiring a reboot may eliminate support for secondary worker auto scaling.

cjac · 2025-01-11T07:39:39Z

I've drafted GoogleCloudDataproc/initialization-actions#1269 to fix the MIG scripts. I can see mig is enabled by querying nvidia-smi --query-gpu=mig.mode.current --format=csv,noheader | grep Enabled on the cluster. Can you please use https://github.com/SurajAralihalli/initialization-actions/blob/fix_mig/spark-rapids/mig.sh and confirm if it works for your use case.

Suraj,

Do you know if there is any way to reset the a100 successfully without a reboot? I see that nvidia-smi has a mechanism to list processes known to be utilizing the kernel driver. And repeatedly rmmod'ing each module in sequence about five times usually results in successful removal of the symbols from the kernel. I would expect that running nvidia-smi --reset-gpu from a system without the module loaded would

Load the module
Initialize the device
Return $?==0

The documentation on the Internet says "have you tried rebooting" like from the IT support desk front line. The answer is that the system I'm working with cannot be relied on to keep state on reboot. The hypervisor system on which the guest executes may change, and thus would have a new device attached. If there is any reliable way, without a reboot, to reset the hardware, power cycle the pci-e card, whatever is necessary, I would like to perform those steps rather than relying on a reboot. It extends the already lengthy time necessary to complete the initialization of the host.

yinqingh added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 30, 2024

yinqingh changed the title ~~[BUG] Failed to run mig.sh on dataproc-2.1-ubuntu20~~ [BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 Oct 30, 2024

viadea assigned SurajAralihalli Nov 5, 2024

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 5, 2024

SurajAralihalli mentioned this issue Dec 6, 2024

[gpu][spark-rapids] Fix MIG script GoogleCloudDataproc/initialization-actions#1269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

yinqingh commented Oct 30, 2024 •

edited

Loading

pxLi commented Oct 30, 2024

SurajAralihalli commented Nov 7, 2024 •

edited

Loading

yinqingh commented Nov 8, 2024

pxLi commented Dec 3, 2024

SurajAralihalli commented Dec 4, 2024

cjac commented Jan 7, 2025

cjac commented Jan 10, 2025

yinqingh commented Jan 10, 2025

cjac commented Jan 10, 2025 •

edited

Loading

yinqingh commented Jan 10, 2025

cjac commented Jan 11, 2025 •

edited

Loading

cjac commented Jan 11, 2025 •

edited

Loading

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20 #11675

Comments

yinqingh commented Oct 30, 2024 • edited Loading

pxLi commented Oct 30, 2024

SurajAralihalli commented Nov 7, 2024 • edited Loading

yinqingh commented Nov 8, 2024

pxLi commented Dec 3, 2024

SurajAralihalli commented Dec 4, 2024

cjac commented Jan 7, 2025

cjac commented Jan 10, 2025

yinqingh commented Jan 10, 2025

cjac commented Jan 10, 2025 • edited Loading

yinqingh commented Jan 10, 2025

cjac commented Jan 11, 2025 • edited Loading

cjac commented Jan 11, 2025 • edited Loading

yinqingh commented Oct 30, 2024 •

edited

Loading

SurajAralihalli commented Nov 7, 2024 •

edited

Loading

cjac commented Jan 10, 2025 •

edited

Loading

cjac commented Jan 11, 2025 •

edited

Loading

cjac commented Jan 11, 2025 •

edited

Loading