Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary Vast AI support #4365

Open
wants to merge 106 commits into
base: master
Choose a base branch
from

Conversation

kristopolous
Copy link

@kristopolous kristopolous commented Nov 15, 2024

This is preliminary support for Vast. It currently works on an unreleased version of the SDK which we will soon get up to PyPy

The document https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?pli=1&tab=t.0 was followed and all the testing passed

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

I'm pretty sure there will need to be edits, I'm fine with that. This is attempt 1. The outstanding work:

We need to

  • tidy up our dockerhub and will get a better image to launch.
  • release the updates to the sdk and come up with a pip name for it.
  • get our catalog to update in the git hook flow as described (my goal is every 6 hours)

@Michaelvll Michaelvll requested a review from cblmemo November 16, 2024 02:46
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

sky/adaptors/vast.py Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/instance.py Outdated Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
@kristopolous
Copy link
Author

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

historically, runpod was a clone of vast. We currently offer docker-style containers and will be providing vms soonish (probably before end of year)

@kristopolous kristopolous force-pushed the vast.ai-support branch 3 times, most recently from e9e922a to 4c9aff9 Compare November 21, 2024 22:28
@kristopolous
Copy link
Author

these test passing is blocked by https://github.com/skypilot-org/skypilot-catalog/pull/100/commits

@kristopolous
Copy link
Author

We only offer GPUs instances ... you are free to use the CPU if you'd like but we're a GPU shop

Got it. Feel free to ignore this one.

so most of these are ok now ... i've done a number of fixes to the tests in general ... I can send you my logs but that's just "works on my machine" stuff ... so let my communicate how I'm running these:

10:25 /home/chris/code/skypilot$ source env/bin/activate                         
(env) 10:25 /home/chris/code/skypilot$ pip3 install -e .                                                                                                                                                             
(env) 10:25 /home/chris/code/skypilot$ pytest -v -n 1 tests/test_smoke.py --vast 

if we can agree this is sensible, all the tests you asked for should be passing

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 21, 2025

We only offer GPUs instances ... you are free to use the CPU if you'd like but we're a GPU shop

Got it. Feel free to ignore this one.

so most of these are ok now ... i've done a number of fixes to the tests in general ... I can send you my logs but that's just "works on my machine" stuff ... so let my communicate how I'm running these:

10:25 /home/chris/code/skypilot$ source env/bin/activate                         
(env) 10:25 /home/chris/code/skypilot$ pip3 install -e .                                                                                                                                                             
(env) 10:25 /home/chris/code/skypilot$ pytest -v -n 1 tests/test_smoke.py --vast 

if we can agree this is sensible, all the tests you asked for should be passing

Seems like there are still some CI test failing (tests/test_optimizer_random_dag.py. Could you help fixing them?

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @kristopolous ! It mostly looks good to me. Left final nits ;)

tests/smoke_tests/smoke_tests_utils.py Show resolved Hide resolved
tests/smoke_tests/test_basic.py Outdated Show resolved Hide resolved
sky/provision/vast/utils.py Outdated Show resolved Hide resolved
sky/provision/vast/utils.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Jan 21, 2025

/smoke-test aws

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 21, 2025

Seems like the smoke test on AWS passed as well. After resolving those final nits it should be ready to go!

@kristopolous
Copy link
Author

test_optimizer_random_dag

pytest -v -n 1 tests/test_optimizer_random_dag.py --vast

This test passes. Is there anything else we need?

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing @kristopolous ! Left one final nits. Could you also help resolve the merge conflicts?

Seems like the CI is failing because the catalog is missing. I just merged the catalog PR and I'm running the tests one last time. If it passed then this should be ready to go!

tests/smoke_tests/test_basic.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Jan 24, 2025

/smoke-test aws

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 26, 2025

Seems like the tests/test_optimizer_dryruns.py is failing. Could you help to fix that?

https://github.com/skypilot-org/skypilot/actions/runs/12971661591/job/36178313189?pr=4365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants