Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary Vast AI support #4365

Open
wants to merge 113 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
a41d633
Preliminary Vast AI support
kristopolous Nov 22, 2024
a3e770f
Update sky/provision/vast/instance.py
kristopolous Nov 19, 2024
d2c4dca
Update sky/clouds/vast.py
kristopolous Nov 19, 2024
5deac26
Update sky/clouds/vast.py
kristopolous Nov 19, 2024
372b860
Update sky/clouds/vast.py
kristopolous Nov 19, 2024
e5f5f3f
Updating the vast dependencies in the setup.py
kristopolous Nov 20, 2024
15b5f5b
Vast: Copy update on object stores
kristopolous Nov 20, 2024
fbeaa14
Vast: update base image dockerhub link
kristopolous Nov 20, 2024
60abaec
Vast: removing errant comment
kristopolous Nov 20, 2024
0f8a035
Vast: provision/utils cleanup of a shallow copy
kristopolous Nov 20, 2024
bb8a6eb
Vast: Simplifying the credential files mount
kristopolous Nov 20, 2024
6107b82
Vast: Linter cleanup
kristopolous Nov 21, 2024
0cba961
Vast: Internal api cleanup
kristopolous Nov 21, 2024
9287709
Vast: Adding the catalog_fetcher
kristopolous Nov 21, 2024
b299ea4
Vast: Linting fixes
kristopolous Nov 21, 2024
b8e3752
Vast: Linting fixes
kristopolous Nov 21, 2024
1bbbbcc
Vast: ordering the ports
kristopolous Nov 22, 2024
7face5e
Vast: Updating a function signature
kristopolous Nov 22, 2024
3eb8823
Vast: comment cleanup
kristopolous Dec 12, 2024
6a6c215
Vast: Adding a comment for disk_size calculation
kristopolous Dec 12, 2024
a9d2ff4
Vast: Comment on the geolocation string processing
kristopolous Dec 12, 2024
fc69f73
Vast: Comment on the rammifications of searching for instances
kristopolous Dec 13, 2024
5f9cc3b
Vast: Leaving a comment for a pylint exception
kristopolous Dec 13, 2024
2f8aa92
Update sky/provision/vast/instance.py
kristopolous Dec 23, 2024
fcc5cef
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Dec 23, 2024
94f1155
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Dec 23, 2024
d5bd08a
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Dec 23, 2024
fc73399
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Dec 23, 2024
63dcb90
Update sky/clouds/vast.py
kristopolous Dec 23, 2024
0464809
Vast: updating the catalog fetcher
kristopolous Dec 24, 2024
0e6338e
Fixing a white-space error
kristopolous Dec 24, 2024
f2d8381
Vast: Adding a comment for the instance type
kristopolous Dec 24, 2024
bddc425
Vast: rephrase a docstring
kristopolous Dec 24, 2024
a996037
Vast: Reverting the setup.py
kristopolous Dec 24, 2024
eba7a3a
Vast: file reversion
kristopolous Dec 24, 2024
b6aae10
Vast: Updating the MemoryGiB to reflect the correct value
kristopolous Dec 27, 2024
5f50e84
Vast: Adding authorship to TODO
kristopolous Dec 27, 2024
f80b013
Vast: Filtering instances waiting on startup request
kristopolous Dec 27, 2024
1acbe0c
Vast: Updating the install docs
kristopolous Dec 31, 2024
c5f75bc
Vast: catalog updated to adaptor
kristopolous Dec 31, 2024
20f623d
Vast: fixing a filter_instance typo
kristopolous Jan 1, 2025
40ff2fd
Vast: stating open ports not supported
kristopolous Jan 1, 2025
eef2029
Vast: Adding a requested comment in the fetcher
kristopolous Jan 1, 2025
39479c1
Vast: Comment for the maximum cluster name limit
kristopolous Jan 1, 2025
ee4ab72
Vast: comment about disk space limits
kristopolous Jan 1, 2025
868e3bc
Update sky/clouds/vast.py
kristopolous Jan 3, 2025
41ce708
Update sky/provision/vast/instance.py
kristopolous Jan 3, 2025
0d3f0de
Merge branch 'master' into vast.ai-support
kristopolous Jan 3, 2025
6354ab9
Vast: Fixing a few inconsistencies
kristopolous Jan 4, 2025
2663231
Vast: adding smoke tests and disabling some for vast
kristopolous Jan 7, 2025
5286660
Vast: Updating the ssh key adding to avoid duplicates
kristopolous Jan 7, 2025
5847eaa
Vast: formatter update
kristopolous Jan 7, 2025
ca71190
Vast: Updating catalog emitter with flat pricing
kristopolous Jan 9, 2025
d380ad4
Vast: Using a longer polling loop
kristopolous Jan 9, 2025
eabe4a7
Vast: Updating the spot price to be real
kristopolous Jan 9, 2025
0042829
Vast: Adding spot instance
kristopolous Jan 9, 2025
ed8baa1
Vast: updating test coverage
kristopolous Jan 10, 2025
81d3bdb
Vast: Forcing memory into the instance type
kristopolous Jan 10, 2025
7061d13
Vast: test coverage updates
kristopolous Jan 10, 2025
0ba96e6
Vast: updating test coverage
kristopolous Jan 11, 2025
db81ffb
Vast: formatter cleanup
kristopolous Jan 11, 2025
d9d9029
Merge branch 'master' into vast.ai-support
kristopolous Jan 11, 2025
0132603
trying to resolve weird git issue
kristopolous Jan 11, 2025
4bd5a67
Vast: adding back the code to the controller
kristopolous Jan 11, 2025
03cf1c8
Vast: updating to the new way of doing the controller
kristopolous Jan 11, 2025
7a62207
Vast: updating to the new way of doing the controller
kristopolous Jan 11, 2025
0f5764e
Vast: excluding two more tests
kristopolous Jan 14, 2025
87b1f5f
Merge branch 'master' into vast.ai-support
kristopolous Jan 14, 2025
452b2df
Vast: tests are still not running properly
kristopolous Jan 14, 2025
d15845a
Vast: removing a dynamic port test
kristopolous Jan 14, 2025
c69e902
Vast: removing a dynamic port test
kristopolous Jan 14, 2025
45b22a2
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Jan 14, 2025
9bcc68d
Update sky/clouds/service_catalog/data_fetchers/fetch_vast.py
kristopolous Jan 14, 2025
9c2712f
Vast.ai: forcing utils to not be a broken merge again
kristopolous Jan 14, 2025
4a12e81
trying to fix a github issue
kristopolous Jan 14, 2025
6edaaf3
Vast: trying to fix a github bug
kristopolous Jan 14, 2025
ed1d920
Vast: trying to fix a github pr issue
kristopolous Jan 14, 2025
b16d673
Merge branch 'master' into vast.ai-support
kristopolous Jan 15, 2025
77b0d95
Vast: redoing the default image
kristopolous Jan 15, 2025
1736013
Vast: reintroducing dependency
kristopolous Jan 16, 2025
b65fc2d
Vast: launch template cleanup
kristopolous Jan 16, 2025
df361e2
Vast: code formatting
kristopolous Jan 16, 2025
46e1875
Vast: Using the access key and not the master key
kristopolous Jan 16, 2025
f2745d4
Merge branch 'master' into vast.ai-support
kristopolous Jan 16, 2025
7183fb3
Vast: template cleanup
kristopolous Jan 16, 2025
78fdcf6
autostoped => autostopped
kristopolous Jan 16, 2025
652a136
Removing an extra echo from the smoke tests
kristopolous Jan 16, 2025
1f587db
Expose the return code of a failing test to the logs
kristopolous Jan 16, 2025
97292b6
Vast: template cleanup
kristopolous Jan 17, 2025
f391378
Vast: increasing the api version dependency
kristopolous Jan 17, 2025
0971c6c
smoke test Log line: less -> less -r to maintain colors
kristopolous Jan 17, 2025
105e714
Vast: test skipping
kristopolous Jan 17, 2025
51f68f6
Vast: test skipping
kristopolous Jan 17, 2025
9c68da3
Moving "less" suggestion to "less -r" to preserve the ANSI
kristopolous Jan 17, 2025
4b10d91
Vast: test skipping
kristopolous Jan 17, 2025
a1bda5d
Some tests didn't have the right cloud differentiator
kristopolous Jan 17, 2025
2661a4b
Vast: test skipping
kristopolous Jan 18, 2025
51b3a24
Vast: skipping tests
kristopolous Jan 18, 2025
890b11a
linter cleanup
kristopolous Jan 21, 2025
c1ed759
Update sky/provision/vast/utils.py
kristopolous Jan 22, 2025
85b167a
Vast: skipping a test
kristopolous Jan 22, 2025
f83d114
tests: removing a useful extratimeout feature
kristopolous Jan 22, 2025
6fc6a2b
Vast: ssh port code fix
kristopolous Jan 22, 2025
8280216
Vast: ssh port code fix
kristopolous Jan 22, 2025
e14548c
Vast: removing some unused code from the test
kristopolous Jan 25, 2025
efe90d6
Merge branch 'master' into vast.ai-support
kristopolous Jan 26, 2025
71062c2
Merge branch 'master' into vast.ai-support
kristopolous Jan 28, 2025
028e09c
Vastai: newer sdk with debugging ability
kristopolous Jan 28, 2025
6abf1f5
Vastai: force the cpu_ram to match in the search_offers
kristopolous Jan 28, 2025
5832ae6
Vast: formatting update
kristopolous Jan 28, 2025
595caaf
Vast: limiting the catalog to fewer matches
kristopolous Jan 28, 2025
abfc047
Vast: enforcing memory requirements when considering instances
kristopolous Jan 28, 2025
edda5f8
Vast: requiring only 80GB space instead of 256 for catalog inclusion
kristopolous Jan 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,16 @@ Paperspace
mkdir -p ~/.paperspace
echo "{'api_key' : <your_api_key_here>}" > ~/.paperspace/config.json

Vast
~~~~~~~~~~

`Vast <https://vast.ai/>`__ is a cloud provider that offers low-cost GPUs. To configure Vast access, go to the `Account <https://cloud.vast.ai/account/>`_ page on your Vast console to get your **API key**. Then, run:

.. code-block:: shell

pip install "vastai-sdk>=0.1.3"
echo "<your_api_key_here>" > ~/.vast_api_key

RunPod
~~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
OCI = clouds.OCI
Paperspace = clouds.Paperspace
RunPod = clouds.RunPod
Vast = clouds.Vast
Vsphere = clouds.Vsphere
Fluidstack = clouds.Fluidstack
optimize = Optimizer.optimize
Expand All @@ -150,6 +151,7 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
'OCI',
'Paperspace',
'RunPod',
'Vast',
'SCP',
'Vsphere',
'Fluidstack',
Expand Down
29 changes: 29 additions & 0 deletions sky/adaptors/vast.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""Vast cloud adaptor."""

import functools

_vast_sdk = None


def import_package(func):

@functools.wraps(func)
def wrapper(*args, **kwargs):
global _vast_sdk

if _vast_sdk is None:
try:
import vastai_sdk as _vast # pylint: disable=import-outside-toplevel
_vast_sdk = _vast.VastAI()
except ImportError:
raise ImportError('Fail to import dependencies for vast.'
'Try pip install "skypilot[vast]"') from None
kristopolous marked this conversation as resolved.
Show resolved Hide resolved
return func(*args, **kwargs)

return wrapper


@import_package
def vast():
"""Return the vast package."""
return _vast_sdk
18 changes: 18 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
from sky.adaptors import ibm
from sky.adaptors import kubernetes
from sky.adaptors import runpod
from sky.adaptors import vast
from sky.provision.fluidstack import fluidstack_utils
from sky.provision.kubernetes import utils as kubernetes_utils
from sky.provision.lambda_cloud import lambda_utils
Expand Down Expand Up @@ -485,6 +486,23 @@ def setup_runpod_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
return configure_ssh_info(config)


def setup_vast_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
"""Sets up SSH authentication for Vast.
- Generates a new SSH key pair if one does not exist.
- Adds the public SSH key to the user's Vast account.
"""
_, public_key_path = get_or_generate_keys()
with open(public_key_path, 'r', encoding='UTF-8') as pub_key_file:
public_key = pub_key_file.read().strip()
current_key_list = vast.vast().show_ssh_keys() # pylint: disable=assignment-from-no-return
# Only add an ssh key if it hasn't already been added
if not any(x['public_key'] == public_key for x in current_key_list):
vast.vast().create_ssh_key(ssh_key=public_key)

config['auth']['ssh_public_key'] = PUBLIC_SSH_KEY_PATH
return configure_ssh_info(config)


def setup_fluidstack_authentication(config: Dict[str, Any]) -> Dict[str, Any]:

get_or_generate_keys()
Expand Down
5 changes: 4 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1056,6 +1056,8 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
config = auth.setup_ibm_authentication(config)
elif isinstance(cloud, clouds.RunPod):
config = auth.setup_runpod_authentication(config)
elif isinstance(cloud, clouds.Vast):
config = auth.setup_vast_authentication(config)
elif isinstance(cloud, clouds.Fluidstack):
config = auth.setup_fluidstack_authentication(config)
else:
Expand Down Expand Up @@ -2135,7 +2137,8 @@ def run_ray_status_to_check_ray_cluster_healthy() -> bool:
except exceptions.CommandError as e:
success = False
if e.returncode == 255:
logger.debug(f'The cluster is likely {noun}ed.')
word = 'autostopped' if noun == 'autostop' else 'autodowned'
logger.debug(f'The cluster is likely {word}.')
reset_local_autostop = False
except (Exception, SystemExit) as e: # pylint: disable=broad-except
success = False
Expand Down
1 change: 1 addition & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ def _get_cluster_config_template(cloud):
clouds.RunPod: 'runpod-ray.yml.j2',
clouds.Kubernetes: 'kubernetes-ray.yml.j2',
clouds.Vsphere: 'vsphere-ray.yml.j2',
clouds.Vast: 'vast-ray.yml.j2',
clouds.Fluidstack: 'fluidstack-ray.yml.j2'
}
return cloud_to_template[type(cloud)]
Expand Down
2 changes: 2 additions & 0 deletions sky/clouds/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from sky.clouds.paperspace import Paperspace
from sky.clouds.runpod import RunPod
from sky.clouds.scp import SCP
from sky.clouds.vast import Vast
from sky.clouds.vsphere import Vsphere

__all__ = [
Expand All @@ -39,6 +40,7 @@
'Paperspace',
'SCP',
'RunPod',
'Vast',
'OCI',
'Vsphere',
'Kubernetes',
Expand Down
2 changes: 1 addition & 1 deletion sky/clouds/service_catalog/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
CATALOG_SCHEMA_VERSION = 'v6'
CATALOG_DIR = '~/.sky/catalogs'
ALL_CLOUDS = ('aws', 'azure', 'gcp', 'ibm', 'lambda', 'scp', 'oci',
'kubernetes', 'runpod', 'vsphere', 'cudo', 'fluidstack',
'kubernetes', 'runpod', 'vast', 'vsphere', 'cudo', 'fluidstack',
'paperspace', 'do')
112 changes: 112 additions & 0 deletions sky/clouds/service_catalog/data_fetchers/fetch_vast.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
"""A script that generates the Vast Cloud catalog. """

#
# Due to the design of the sdk, pylint has a false
# positive for the fnctions.
#
# pylint: disable=assignment-from-no-return
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
import collections
import csv
import json
import math
import re
import sys
from typing import Any, Dict, List

from sky.adaptors import vast

_map = {
'TeslaV100': 'V100',
'TeslaT4': 'T4',
'TeslaP100': 'P100',
'QRTX6000': 'RTX6000',
'QRTX8000': 'RTX8000'
}


def create_instance_type(obj: Dict[str, Any]) -> str:
stubify = lambda x: re.sub(r'\s', '_', x)
return '{}x-{}-{}-{}'.format(obj['num_gpus'], stubify(obj['gpu_name']),
obj['cpu_cores'], obj['cpu_ram'])


def dot_get(d: dict, key: str) -> Any:
for k in key.split('.'):
d = d[k]
return d


if __name__ == '__main__':
# InstanceType and gpuInfo are basically just stubs
# so that the dictwriter is happy without weird
# code.
mapped_keys = (('gpu_name', 'InstanceType'), ('gpu_name',
'AcceleratorName'),
('num_gpus', 'AcceleratorCount'), ('cpu_cores', 'vCPUs'),
('cpu_ram', 'MemoryGiB'), ('gpu_name', 'GpuInfo'),
('search.totalHour', 'Price'), ('min_bid', 'SpotPrice'),
('geolocation', 'Region'))
writer = csv.DictWriter(sys.stdout, fieldnames=[x[1] for x in mapped_keys])
writer.writeheader()

# Vast has a wide variety of machines, some of
# which will have less diskspace and network
# bandwidth than others.
offerList = vast.vast().search_offers(
query='inet_down >= 100 disk_space >= 80', limit=10000)
priceMap: Dict[str, List] = collections.defaultdict(list)
for offer in offerList:
entry = {}
for ours, theirs in mapped_keys:
field = dot_get(offer, ours)
entry[theirs] = field

instance_type = create_instance_type(offer)
entry['InstanceType'] = instance_type

# the documentation says
# "{'gpus': [{
# 'name': 'v100',
# 'manufacturer': 'nvidia',
# 'count': 8.0,
# 'memoryinfo': {'sizeinmib': 16384}
# }],
# 'totalgpumemoryinmib': 16384}",
# we can do that.
entry['MemoryGiB'] /= 1024

gpu = re.sub('Ada', '-Ada', re.sub(r'\s', '', offer['gpu_name']))
gpu = re.sub(r'(Ti|PCIE|SXM4|SXM|NVL)$', '', gpu)
gpu = re.sub(r'(RTX\d0\d0)(S|D)$', r'\1', gpu)

if gpu in _map:
gpu = _map[gpu]

entry['AcceleratorName'] = gpu
entry['GpuInfo'] = json.dumps({
'Gpus': [{
'Name': gpu,
'Count': offer['num_gpus'],
'MemoryInfo': {
'SizeInMiB': offer['gpu_total_ram']
}
}],
'TotalGpuMemoryInMiB': offer['gpu_total_ram']
}).replace('"', '\'')

priceMap[instance_type].append(entry)

for instanceList in priceMap.values():
priceList = sorted([x['Price'] for x in instanceList])
index = math.ceil(0.8 * len(priceList)) - 1
priceTarget = priceList[index]
toList: List = []
for instance in instanceList:
if instance['Price'] <= priceTarget:
instance['Price'] = '{:.2f}'.format(priceTarget)
toList.append(instance)

maxBid = max([x.get('SpotPrice') for x in toList])
for instance in toList:
instance['SpotPrice'] = '{:.2f}'.format(maxBid)
writer.writerow(instance)
104 changes: 104 additions & 0 deletions sky/clouds/service_catalog/vast_catalog.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
""" Vast | Catalog

This module loads the service catalog file and can be used to
query instance types and pricing information for Vast.ai.
"""

import typing
from typing import Dict, List, Optional, Tuple, Union

from sky.clouds.service_catalog import common
from sky.utils import ux_utils

if typing.TYPE_CHECKING:
from sky.clouds import cloud

_df = common.read_catalog('vast/vms.csv')


def instance_type_exists(instance_type: str) -> bool:
return common.instance_type_exists_impl(_df, instance_type)


def validate_region_zone(
region: Optional[str],
zone: Optional[str]) -> Tuple[Optional[str], Optional[str]]:
if zone is not None:
with ux_utils.print_exception_no_traceback():
raise ValueError('Vast does not support zones.')
return common.validate_region_zone_impl('vast', _df, region, zone)


def get_hourly_cost(instance_type: str,
use_spot: bool = False,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
"""Returns the cost, or the cheapest cost among all zones for spot."""
if zone is not None:
with ux_utils.print_exception_no_traceback():
raise ValueError('Vast does not support zones.')
return common.get_hourly_cost_impl(_df, instance_type, use_spot, region,
zone)


def get_vcpus_mem_from_instance_type(
instance_type: str) -> Tuple[Optional[float], Optional[float]]:
return common.get_vcpus_mem_from_instance_type_impl(_df, instance_type)


def get_default_instance_type(cpus: Optional[str] = None,
memory: Optional[str] = None,
disk_tier: Optional[str] = None) -> Optional[str]:
del disk_tier
# NOTE: After expanding catalog to multiple entries, you may
# want to specify a default instance type or family.
return common.get_instance_type_for_cpus_mem_impl(_df, cpus, memory)


def get_accelerators_from_instance_type(
instance_type: str) -> Optional[Dict[str, Union[int, float]]]:
return common.get_accelerators_from_instance_type_impl(_df, instance_type)


def get_instance_type_for_accelerator(
acc_name: str,
acc_count: int,
cpus: Optional[str] = None,
memory: Optional[str] = None,
use_spot: bool = False,
region: Optional[str] = None,
zone: Optional[str] = None) -> Tuple[Optional[List[str]], List[str]]:
"""Returns a list of instance types that have the given accelerator."""
if zone is not None:
with ux_utils.print_exception_no_traceback():
raise ValueError('Vast does not support zones.')
return common.get_instance_type_for_accelerator_impl(df=_df,
acc_name=acc_name,
acc_count=acc_count,
cpus=cpus,
memory=memory,
use_spot=use_spot,
region=region,
zone=zone)


def get_region_zones_for_instance_type(instance_type: str,
use_spot: bool) -> List['cloud.Region']:
df = _df[_df['InstanceType'] == instance_type]
return common.get_region_zones(df, use_spot)


# TODO: this differs from the fluffy catalog version
def list_accelerators(
gpus_only: bool,
name_filter: Optional[str],
region_filter: Optional[str],
quantity_filter: Optional[int],
case_sensitive: bool = True,
all_regions: bool = False,
require_price: bool = True) -> Dict[str, List[common.InstanceTypeInfo]]:
"""Returns all instance types in Vast offering GPUs."""
del require_price # Unused.
return common.list_accelerators_impl('Vast', _df, gpus_only, name_filter,
region_filter, quantity_filter,
case_sensitive, all_regions)
Loading
Loading