Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu][spark-rapids] Fix MIG script #1269

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SurajAralihalli
Copy link
Contributor

Resolves #1259

This PR updates the MIG script to align the driver installation steps with those in spark-rapids.sh.

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Contributor Author

Hey @cjac, I can't add you as a reviewer for this PR. Could you please take a look? Thanks!
This PR was tested for NVIDIA/spark-rapids#11675.

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Take a look at the latest version of the utility scripts here:

https://github.com/LLC-Technologies-Collier/initialization-actions/blob/db33fce27205535f440fa5cc0545cabc591e6730/gpu/install_gpu_driver.sh#L19

function os_id() ( set +x ; grep '^ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_version() ( set +x ; grep '^VERSION_ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_codename() ( set +x ; grep '^VERSION_CODENAME=' /etc/os-release | cut -d= -f2 | xargs ; )

function version_ge() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | tail -n1)" ] ; )
function version_gt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_ge $1 $2 ; )
function version_le() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] ; )
function version_lt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_le $1 $2 ; )

readonly -A supported_os=(
['debian']="10 11 12"
['rocky']="8 9"
['ubuntu']="18.04 20.04 22.04"
)

dynamically define OS version test utility functions

if [[ "$(os_id)" == "rocky" ]];
then _os_version=$(os_version | sed -e 's/[^0-9].*$//g')
else os_version="$(os_version)"; fi
for os_id_val in 'rocky' 'ubuntu' 'debian' ; do
eval "function is
${os_id_val}() ( set +x ; [[ "$(os_id)" == '${os_id_val}' ]] ; )"

for osver in $(echo "${supported_os["${os_id_val}"]}") ; do
eval "function is_${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && [[ "${os_version}" == "${osver}" ]] ; )"
eval "function ge
${os_id_val}${osver%%.
}() ( set +x ; is_${os_id_val} && version_ge "${os_version}" "${osver}" ; )"
eval "function le
${os_id_val}${osver%%.*}() ( set +x ; is_${os_id_val} && version_le "${_os_version}" "${osver}" ; )"
done
done

function is_debuntu() ( set +x ; is_debian || is_ubuntu ; )

function os_vercat() ( set +x
if is_ubuntu ; then os_version | sed -e 's/[^0-9]//g'
elif is_rocky ; then os_version | sed -e 's/[^0-9].*$//g'
else os_version ; fi ; )

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should wait on the next PR. There are a lot of things happening in the GPU driver that should be included, IMHO.

CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.2.2') #12.2.2
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '535.104.05') #535.104.05
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.4.1') #12.2.2
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '550.54.15') #535.104.05
CUDA_VERSION_MAJOR="${CUDA_VERSION%.*}" #12.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda and driver version selection code is presently getting a revamp, too. You might want to wait for that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. Can we include these improvements in a different PR to unblock the customers (NVIDIA/spark-rapids#11675)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the revamp is in #1275 and is ready for review. If you can take a look and leave a comment, it will give the final reviewers solace to know that our users have exercised the changes.

WORKDIR=/opt/install-nvidia-driver
mkdir -p "${WORKDIR}"
pushd $_
# Fetch open souce kernel module with corresponding tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*source

execute_with_retries "apt-get update"

execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}"
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}"

# enable a systemd service that updates kernel headers after reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, bad idea. The kernel should not change. To get a new kernel, upgrade the image. The kernel version should be pinned, in my opinion. Otherwise, there are too many failure states.

Copy link
Contributor Author

@SurajAralihalli SurajAralihalli Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Currently, reboots can trigger kernel upgrades, potentially breaking the drivers. We should avoid kernel upgrades.

Update: I'm concerned that pinning the kernel might prevent security-related fixes from being applied to existing clusters. Please correct me if my concern is unwarranted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I can support customers running outdated images. I guess we've been doing it to date by applying main force. But it's not really sustainable. Can you tell me whether your users will be able to move to a new dataproc image to get new kernels?

The customer can only apply new kernel fixes if they reboot into a new kernel. When they do that, they will lose the drivers that I build for them just for that kernel version. I do not enable DKMS because we do not recommend changing kernels. Since DKMS is not installed, the system will not even attempt to build a new version of the kernel driver. The assumption is that the service account is only granted access to the driver signing private key material during the installation, or more realistically, during custom image generation.

On systems with secure boot enabled, the user will not be able to fetch the signing material required for driver re-build. It's just a lot less hassle if customers decommission old images and move to new images regularly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We transitioned away from installing drivers using run files because this approach caused issues with drivers during kernel upgrades and subsequent reboots. This was likely due to the kernel version not being pinned, leading to kernel upgrades.

Are you suggesting we pin the kernel version for each Dataproc image version (e.g., v2.0, v2.1, v2.2)?
If a user has a long-running GPU cluster with v2.2 and needs to upgrade the kernel for security patches, what would be the recommended process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a lot less hassle if customers decommission old images and move to new images regularly

  • Does Dataproc (v2.0, v2.1, v2.2) support minor versions?
  • How can one ensure the latest kernel is installed during cluster creation, specifically at the time of installing the run file?

@cjac
Copy link
Contributor

cjac commented Dec 20, 2024

I want to create a template system so that we don't have to keep these functions synchronized manually.

A job for another day.

That day is today. Please take a look at #1282

@SurajAralihalli
Copy link
Contributor Author

That day is today. Please take a look at #1282

Thanks for sharing! I'm currently traveling, I'll get to it as soon as I can.

@cjac
Copy link
Contributor

cjac commented Jan 10, 2025

I've moved the mig-specific changes into PR #1284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gpu][spark-rapids] Consolidate mig.sh Scripts and Sync Driver Installation Steps Across Copies
2 participants