-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[K8s] Custom Image Support for Kubernetes Instances #2729
[K8s] Custom Image Support for Kubernetes Instances #2729
Conversation
Adding |
…o k8s_custom_image # Conflicts: # Dockerfile_k8s # sky/backends/backend_utils.py # sky/clouds/kubernetes.py
… k8s_custom_image
Made a few fixes and added Quick guide for anyone interested in trying out:
If your custom image is hosted on a private registry, you'll need to setup the Kubernetes secret to access it. Once the secret is setup, add the following to your
This will automatically add the secret to the pod spec when SkyPilot creates the pod. |
sky/templates/kubernetes-ray.yml.j2
Outdated
(pip3 list | grep skypilot && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[remote]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1); | ||
sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'; | ||
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload; | ||
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config; | ||
pip3 install ray==2.4.0 aiohttp opencensus prometheus_client aiohttp_cors kubernetes==28.1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray installation sould go before skypilot installation and needs some patch to make it work. Reference:
skypilot/sky/templates/aws-ray.yml.j2
Line 163 in efdd07d
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}); |
skypilot/sky/templates/aws-ray.yml.j2
Line 170 in efdd07d
python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1; |
f'Setting up SSH in pod.') | ||
self._setup_ssh_in_pods(new_nodes) | ||
logger.info(config.log_prefix + | ||
f'Setting up environment variables in pod.') | ||
self._set_env_vars_in_pods(new_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to setup envs for jump pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The envs are necessary for pods to interact with GPUs, so it's only necessary for the instance pods, not for jump pod.
Thanks @landscapepainter for submitting this PR! Since the changes has been adopted into #3019 and merged into master, probably we should close this PR. : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for k8s custom image @landscapepainter! Most of the parts of this PR has been adopted in #3019, but we should have got this PR in first before that PR to avoid losing the commit history. Let's get this PR in to keep the commit history.
Note for people looking at this PR: the node_provider
is deprecated for k8s support and all the awesome changes/commits from @landscapepainter has been applied to #3019. We merged this PR for tracking the commit history. : )
This PR resolves #2599
When users opt to provide their custom images for setting up the k8s instance, instead of relying on Skypilot's default image built with
Dockerfile_k8s
, certain assumptions and adjustments are made to ensure smooth operation:Base Distribution: We expect users to provide images that are based on Debian-based distributions.
User Privilege: It's crucial for users to configure the default user's privilege correctly within their custom image by either setting the default user to have root privilege or installing
sudo
.Dependency Installation & SSH Setup: Instead of baking these directly into the image, we've shifted the responsibility of installing dependencies and setting up SSH to the
ray
manifest. These processes are triggered after the pods are active and are managed withinnode_provider.py
under thecreate_node()
method.SSH user update: We have default
ssh_user
ofsky
set inkubernetes-ray.yml.j2
. However, as users' may have different user name, we need to update thessh_user
in the cluster yaml and theSSHCommandRunner
used within NodeProvider. This is each done fromnode_provider.py/create_node()
andnode_provider.py/get_command_runner()
.By adopting these measures, we aim to offer greater flexibility for users bringing their own images while ensuring consistent system behavior.
Update: This PR supports public registries only, and private registries may be supported in near future. Until then, there will be a doc on how to setup for the private registry by the user.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
Tested Base Images:
Tested Registries:
Early failovers: