Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8s] Custom Image Support for Kubernetes Instances #2729

Merged

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Oct 23, 2023

This PR resolves #2599

When users opt to provide their custom images for setting up the k8s instance, instead of relying on Skypilot's default image built with Dockerfile_k8s, certain assumptions and adjustments are made to ensure smooth operation:

Base Distribution: We expect users to provide images that are based on Debian-based distributions.
User Privilege: It's crucial for users to configure the default user's privilege correctly within their custom image by either setting the default user to have root privilege or installing sudo.
Dependency Installation & SSH Setup: Instead of baking these directly into the image, we've shifted the responsibility of installing dependencies and setting up SSH to the ray manifest. These processes are triggered after the pods are active and are managed within node_provider.py under the create_node() method.
SSH user update: We have default ssh_user of sky set in kubernetes-ray.yml.j2. However, as users' may have different user name, we need to update the ssh_user in the cluster yaml and the SSHCommandRunner used within NodeProvider. This is each done from node_provider.py/create_node() and node_provider.py/get_command_runner().

By adopting these measures, we aim to offer greater flexibility for users bringing their own images while ensuring consistent system behavior.

Update: This PR supports public registries only, and private registries may be supported in near future. Until then, there will be a doc on how to setup for the private registry by the user.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Tested Base Images:

  • continuumio/miniconda3:22.11.1
  • nvidia/cuda:12.2.0-devel-ubuntu20.04
  • pytorch/pytorch:latest
  • pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
  • tensorflow/tensorflow:latest
  • tensorflow/tensorflow:nightly-gpu-jupyter
  • ubuntu:latest
  • gpuci/miniforge-cuda:11.4-devel-ubuntu18.04

Tested Registries:

  • AWS ECR public
  • GCP Artifact Registry public
  • Docker Registry public

Early failovers:

  • AWS ECR private handled with error message
  • Docker private registry handled with error message with early failure
  • GCP Registry failover when user's not configured for the GCP account
  • GCP Registry failover when user's configured GCP account does not have IAM authorization
  • Attempting to pull from non-existent repository
  • Attempting to pull from non-existent tag name with correct repository

@landscapepainter landscapepainter marked this pull request as draft October 23, 2023 01:30
@landscapepainter landscapepainter marked this pull request as ready for review November 6, 2023 06:30
@landscapepainter
Copy link
Collaborator Author

Adding curl to be installed in setup_commands at kubernetes-ray.yaml.j2 to address #2799.

@romilbhardwaj
Copy link
Collaborator

Made a few fixes and added pod_override ability to ~/sky/config.yaml. This allows users to specify custom pod spec overrides, such as custom labels, imagePullSecrets, imagePullPolicy etc as required through the sky config yaml. Helps a lot by allowing users to configure pods to their environments.

Quick guide for anyone interested in trying out:
To use a custom image, use the image_id: docker:<your-image-tag> field under resources field in your YAML. Here's an example:

resources:
  cloud: kubernetes
  accelerators: T4:1
  image_id: docker:638892464027.dkr.ecr.us-west-2.amazonaws.com/romilpvtrepo:latest # Or nvidia/cuda:11.8.0-devel-ubuntu18.04

If your custom image is hosted on a private registry, you'll need to setup the Kubernetes secret to access it. Once the secret is setup, add the following to your ~/.sky/config.yaml:

kubernetes:
  pod_override:
    spec:
      imagePullSecrets:
        - name: <your-secret-name>

This will automatically add the secret to the pod spec when SkyPilot creates the pod.

(pip3 list | grep skypilot && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[remote]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1);
sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf';
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config;
pip3 install ray==2.4.0 aiohttp opencensus prometheus_client aiohttp_cors kubernetes==28.1.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray installation sould go before skypilot installation and needs some patch to make it work. Reference:

(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}});

python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1;

f'Setting up SSH in pod.')
self._setup_ssh_in_pods(new_nodes)
logger.info(config.log_prefix +
f'Setting up environment variables in pod.')
self._set_env_vars_in_pods(new_nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to setup envs for jump pod?

Copy link
Collaborator Author

@landscapepainter landscapepainter Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The envs are necessary for pods to interact with GPUs, so it's only necessary for the instance pods, not for jump pod.

@Michaelvll
Copy link
Collaborator

Thanks @landscapepainter for submitting this PR! Since the changes has been adopted into #3019 and merged into master, probably we should close this PR. : )

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for k8s custom image @landscapepainter! Most of the parts of this PR has been adopted in #3019, but we should have got this PR in first before that PR to avoid losing the commit history. Let's get this PR in to keep the commit history.

Note for people looking at this PR: the node_provider is deprecated for k8s support and all the awesome changes/commits from @landscapepainter has been applied to #3019. We merged this PR for tracking the commit history. : )

@landscapepainter landscapepainter merged commit b6ccf1c into skypilot-org:master Jan 31, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Custom image support for Kubernetes
3 participants