Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the dockerfile base image to cuda-dl-base #1248

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Steboss
Copy link

@Steboss Steboss commented Jan 14, 2025

Update the base docker image, so we can use cuda-dl-base from nvcri.

@Steboss Steboss requested review from yhtang and olupton January 14, 2025 11:42
@Steboss
Copy link
Author

Steboss commented Jan 14, 2025

So I run a check in the cuda-dl-base image, and I can see that:

  • for nccl we need to create symlink for include and lib directories, so they're mapped in opt/nvidia/nccl
  • same for cudNN
  • we can safely remove install-ofed.sh
  • and amazon efa support

For the symlink, we'd just need this part of the install-nccl.sh script (and the counterpart in install-cudnn.sh script):

arch=$(uname -m)-linux-gnu
for nccl_file in $(dpkg -L libnccl2 libnccl-dev | sort -u); do
  # Real files and symlinks are linked into $prefix
  if [[ -f "${nccl_file}" || -h "${nccl_file}" ]]; then
    # Replace /usr with $prefix and remove arch-specific lib directories
    nosysprefix="${nccl_file#"/usr/"}"
    noarchlib="${nosysprefix/#"lib/${arch}"/lib}"
    link_name="${prefix}/${noarchlib}"
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${nccl_file}" "${link_name}"
  else
    echo "Skipping ${nccl_file}"
  fi
done

@DwarKapex does it sound right to you?

@Steboss Steboss changed the title Update the dockerfile base image so that we can support NCCL Update the dockerfile base image to cuda-dl-base Jan 15, 2025
@Steboss
Copy link
Author

Steboss commented Jan 15, 2025

@DwarKapex

  • Updated the based Dockerfile to have cuda-dl-base image
  • to avoid having conflicts and re-install of nccl and cudnn i've modified install-cudnn.sh and install-nccl.sh
  • for both script I added a part where we're doing the symlink step to that resources are present in /opt/nvidia/{package}

.github/container/Dockerfile.base Outdated Show resolved Hide resolved
.github/container/Dockerfile.base Show resolved Hide resolved
.github/container/install-cudnn.sh Outdated Show resolved Hide resolved
.github/container/install-nccl.sh Show resolved Hide resolved
@olupton
Copy link
Collaborator

olupton commented Jan 16, 2025

The nsys-jax test failures are because the 24.12 cuda-dl-base includes Nsight Systems 2024.7, whereas we currently install 2024.6 because of some issues with 2024.7 (#1176 is the - pending - attempt to move to 2024.7). A possible workaround would be to use the 24.11 cuda-dl-base for the moment.

@olupton olupton requested a review from DwarKapex January 16, 2025 14:14
nouiz
nouiz previously approved these changes Jan 20, 2025
.github/container/Dockerfile.base Outdated Show resolved Hide resolved
.github/container/Dockerfile.base Show resolved Hide resolved
.github/container/Dockerfile.base Outdated Show resolved Hide resolved
.github/container/Dockerfile.base Show resolved Hide resolved
.github/container/Dockerfile.base Show resolved Hide resolved
.github/container/Dockerfile.base Show resolved Hide resolved
.github/container/install-cudnn.sh Outdated Show resolved Hide resolved
olupton
olupton previously approved these changes Jan 21, 2025
Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one minor nit left.
@yhtang and/or @chaserileyroberts to review the GCP networking relevant parts.

.github/container/Dockerfile.base Show resolved Hide resolved
olupton
olupton previously approved these changes Jan 21, 2025
Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the symlink-xyz scripts modeled after? How do other DLFW containers accomodate the dl core container?

@Steboss
Copy link
Author

Steboss commented Jan 23, 2025

@yhtang
The symlink-xyz scripts are meant to create a symlink for nvcc and cudnn. These packages are already install in cuda-dl-base image, but they're not linked to /opt/nvidia/ folder, as we were doing before.
This is all a jax/xla thing, that's why we might not need this in the other DLFW

@@ -1,27 +1,10 @@
# syntax=docker/dockerfile:1-labs
ARG BASE_IMAGE=nvidia/cuda:12.6.3-devel-ubuntu24.04
ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:24.11-cuda12.6-devel-ubuntu24.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest is 24.12-cuda12.6-devel-ubuntu24.04.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment, I'll update the image now :)

@Steboss
Copy link
Author

Steboss commented Jan 24, 2025

@olupton @yhtang
i think it may be wise to add an additional step in the CI, that check for the very latest cuda-dl-base image, so we can avoid updating this manually, and we'll have an automatic system that does it for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants