-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu][spark-rapids] Fix MIG script #1269
base: master
Are you sure you want to change the base?
[gpu][spark-rapids] Fix MIG script #1269
Conversation
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Hey @cjac, I can't add you as a reviewer for this PR. Could you please take a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there! Take a look at the latest version of the utility scripts here:
function os_id() ( set +x ; grep '^ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_version() ( set +x ; grep '^VERSION_ID=' /etc/os-release | cut -d= -f2 | xargs ; )
function os_codename() ( set +x ; grep '^VERSION_CODENAME=' /etc/os-release | cut -d= -f2 | xargs ; )
function version_ge() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | tail -n1)" ] ; )
function version_gt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_ge $1 $2 ; )
function version_le() ( set +x ; [ "$1" = "$(echo -e "$1\n$2" | sort -V | head -n1)" ] ; )
function version_lt() ( set +x ; [ "$1" = "$2" ] && return 1 || version_le $1 $2 ; )
readonly -A supported_os=(
['debian']="10 11 12"
['rocky']="8 9"
['ubuntu']="18.04 20.04 22.04"
)
dynamically define OS version test utility functions
if [[ "$(os_id)" == "rocky" ]];
then _os_version=$(os_version | sed -e 's/[^0-9].*$//g')
else os_version="$(os_version)"; fi
for os_id_val in 'rocky' 'ubuntu' 'debian' ; do
eval "function is
for osver in
eval "function is_${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && [[ "${os_version}" == "${osver}" ]] ; )"
eval "function ge${os_id_val}${osver%%.}() ( set +x ; is_${os_id_val} && version_ge "${os_version}" "${osver}" ; )"
eval "function le${os_id_val}${osver%%.*}() ( set +x ; is_${os_id_val} && version_le "${_os_version}" "${osver}" ; )"
done
done
function is_debuntu() ( set +x ; is_debian || is_ubuntu ; )
function os_vercat() ( set +x
if is_ubuntu ; then os_version | sed -e 's/[^0-9]//g'
elif is_rocky ; then os_version | sed -e 's/[^0-9].*$//g'
else os_version ; fi ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to create a template system so that we don't have to keep these functions synchronized manually.
A job for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes should wait on the next PR. There are a lot of things happening in the GPU driver that should be included, IMHO.
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.2.2') #12.2.2 | ||
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '535.104.05') #535.104.05 | ||
CUDA_VERSION=$(get_metadata_attribute 'cuda-version' '12.4.1') #12.2.2 | ||
NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' '550.54.15') #535.104.05 | ||
CUDA_VERSION_MAJOR="${CUDA_VERSION%.*}" #12.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda and driver version selection code is presently getting a revamp, too. You might want to wait for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know. Can we include these improvements in a different PR to unblock the customers (NVIDIA/spark-rapids#11675)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the revamp is in #1275 and is ready for review. If you can take a look and leave a comment, it will give the final reviewers solace to know that our users have exercised the changes.
WORKDIR=/opt/install-nvidia-driver | ||
mkdir -p "${WORKDIR}" | ||
pushd $_ | ||
# Fetch open souce kernel module with corresponding tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*source
execute_with_retries "apt-get update" | ||
|
||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-drivers-${NVIDIA_DRIVER_VERSION_PREFIX}" | ||
execute_with_retries "apt-get install -y -q --no-install-recommends cuda-toolkit-${CUDA_VERSION_MAJOR//./-}" | ||
|
||
# enable a systemd service that updates kernel headers after reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, bad idea. The kernel should not change. To get a new kernel, upgrade the image. The kernel version should be pinned, in my opinion. Otherwise, there are too many failure states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Currently, reboots can trigger kernel upgrades, potentially breaking the drivers. We should avoid kernel upgrades.
Update: I'm concerned that pinning the kernel might prevent security-related fixes from being applied to existing clusters. Please correct me if my concern is unwarranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I can support customers running outdated images. I guess we've been doing it to date by applying main force. But it's not really sustainable. Can you tell me whether your users will be able to move to a new dataproc image to get new kernels?
The customer can only apply new kernel fixes if they reboot into a new kernel. When they do that, they will lose the drivers that I build for them just for that kernel version. I do not enable DKMS because we do not recommend changing kernels. Since DKMS is not installed, the system will not even attempt to build a new version of the kernel driver. The assumption is that the service account is only granted access to the driver signing private key material during the installation, or more realistically, during custom image generation.
On systems with secure boot enabled, the user will not be able to fetch the signing material required for driver re-build. It's just a lot less hassle if customers decommission old images and move to new images regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We transitioned away from installing drivers using run files because this approach caused issues with drivers during kernel upgrades and subsequent reboots. This was likely due to the kernel version not being pinned, leading to kernel upgrades.
Are you suggesting we pin the kernel version for each Dataproc image version (e.g., v2.0, v2.1, v2.2)?
If a user has a long-running GPU cluster with v2.2 and needs to upgrade the kernel for security patches, what would be the recommended process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a lot less hassle if customers decommission old images and move to new images regularly
- Does Dataproc (v2.0, v2.1, v2.2) support minor versions?
- How can one ensure the latest kernel is installed during cluster creation, specifically at the time of installing the run file?
That day is today. Please take a look at #1282 |
Thanks for sharing! I'm currently traveling, I'll get to it as soon as I can. |
I've moved the mig-specific changes into PR #1284 |
Resolves #1259
This PR updates the MIG script to align the driver installation steps with those in
spark-rapids.sh
.