Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support Tailscale VPN #4025

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

Conless
Copy link
Contributor

@Conless Conless commented Oct 2, 2024

This pull request is the successor of #3989. It integrates the Tailscale VPN in the core logic of SkyPilot, and designs a cloud-independent implementation that allows users to connect to any cloud instance they launch via SkyPilot through Tailscale.

Now you only need to set these environment variables:

TAILSCALE_AUTH_KEY=tskey-auth-xxx
TAILSCALE_API_KEY=tskey-api-xxx
[email protected]

and add the vpn field in config yaml:

# example.yaml
name: example

vpn:
  tailscale: true

run: |
  echo "Hello, SkyPilot!"

or the same in SkyServe:

# hello-sky-serve.yaml
service:
  readiness_probe: /
  replicas: 2

vpn:
  tailscale: true

resources:
  ports: 8080
  cpus: 2

workdir: .
  
run: python -m http.server 8080

Then all the launched instances will be within the private network. For example, in SkyServe, the sky serve status output will be like this if VPN is enabled.

Services
NAME              VERSION  UPTIME   STATUS  REPLICAS  ENDPOINT            
sky-service-1848  1        13m 51s  READY   2/2       100.xx.xxx.xx:30001  

Service Replicas
SERVICE_NAME      ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES       STATUS  REGION       
sky-service-1848  1   1        http://100.xx.xxx.xx:8080  13 mins ago  1x GCP(vCPU=2)  READY   us-central1  
sky-service-1848  2   1        http://100.xx.xxx.xx:8080  14 mins ago  1x GCP(vCPU=2)  READY   us-central1

Then if you run

ssh sky-serve-controller-xxxxxx
curl http://100.xx.xxx.xx:30001
curl http://100.xx.xxx.xx:8080

they all go through private network and work well.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo cblmemo self-requested a review October 2, 2024 18:18
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @Conless ! This would be really helpful for our users. Several things to double-check:

  • We might not want to open any ports if the vpn is enabled
  • Make sure multi-node cluster works
  • Investigate if it is possible to reduce the number of API key. Earlier this year it requires two types of API key to perform all operations we need, but not sure if the expand their api library.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
@Conless
Copy link
Contributor Author

Conless commented Oct 3, 2024

Thanks for your comment and suggestions @cblmemo! I've resolved most of them, but a few remain for further discussion:

  • Should we allow users to provide only an authentication key? My thinking is that some users might not want to provide an API key due to privacy and security concerns. That's why, in the current design, I use the API key only in the auto-remove process (where it's necessary) and don't raise an error if it's not provided. In the part of getting the VPN IP, I first attempt to use the API, and if it's disabled, I fall back to the local Tailscale command. However, if you think it's rare that users will avoid providing an API key, I could simplify the logic to rely on the API only.
  • Should we consolidate the VPN configuration into a single class for simplicity, or should we keep the current design for the sake of extensibility?

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 4, 2024

Thanks for your comment and suggestions @cblmemo! I've resolved most of them, but a few remain for further discussion:

  • Should we allow users to provide only an authentication key? My thinking is that some users might not want to provide an API key due to privacy and security concerns. That's why, in the current design, I use the API key only in the auto-remove process (where it's necessary) and don't raise an error if it's not provided. In the part of getting the VPN IP, I first attempt to use the API, and if it's disabled, I fall back to the local Tailscale command. However, if you think it's rare that users will avoid providing an API key, I could simplify the logic to rely on the API only.
  • Should we consolidate the VPN configuration into a single class for simplicity, or should we keep the current design for the sake of extensibility?

Thanks for the quick fix @Conless ! For the first one, lets use API only as currently we also need the user to provide this key anyway. I dont think we can skip the removal process without raise an error as this will leave some unhandled remains in user's infra and our user does not like that. Providing fewer keys is both more secure and more simple for the users. We can wait for user's feedback on if we want to move it back ;)
For the second one, let's keep the current design to keep the extensibility. Please add some docstr on the two classes you use and explain why you need both of them :))

@Conless
Copy link
Contributor Author

Conless commented Oct 6, 2024

Hi @cblmemo ! I've updated the implementation based on your suggestions, and finished those tests

  • We might not want to open any ports if the vpn is enabled
  • Make sure multi-node cluster works

Specifically, I tried to launch a cluster with two nodes and serve a task on AWS, and (1) they can be launched and accessed with the VPN IP shown in sky status; (2) their ports are not accessible via the public IP; (3) their VPN configs are correctly removed after they are shut down.

Are there any other things that you think may need double-check?

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 6, 2024

Hi @cblmemo ! I've updated the implementation based on your suggestions, and finished those tests

  • We might not want to open any ports if the vpn is enabled
  • Make sure multi-node cluster works

Specifically, I tried to launch a cluster with two nodes and serve a task on AWS, and (1) they can be launched and accessed with the VPN IP shown in sky status; (2) their ports are not accessible via the public IP; (3) their VPN configs are correctly removed after they are shut down.

Are there any other things that you think may need double-check?

Thanks for the test @Conless ! It looks great to me. One last thing that might need to double check is that if the distributed inference on multi-node is still working on this setting ;)

@Conless
Copy link
Contributor Author

Conless commented Oct 7, 2024

One last thing that might need to double check is that if the distributed inference on multi-node is still working on this setting ;)

Test done @cblmemo ! I've tried to train ResNet model on 2 nodes and serve vicuna on V100:4 using vLLM, and it seems that they all work well.

@Conless Conless requested a review from cblmemo October 7, 2024 15:55
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 7, 2024

Lets add a todo for adding this config to sky launch / sky serve up cli, like --gpus.

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 7, 2024

I tried this and check the cloud console, and seems like we still created a dedicated security group for this instance, though no extra port is opened. Can we make it to reuse the universal security group? If we are using a dedicated security group, we need to wait for terminate resource group, which is very time-consuming. Same goes for GCP and other cloud implementations

resources:
  cloud: aws
  cpus: 2
  ports: 18273

vpn:
  tailscale: true

run: python -m http.server 18273

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @Conless ! Left some discussion :))

sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/templates/sky-serve-controller.yaml.j2 Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
sky/utils/vpn_utils.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 7, 2024

Also, please investigate if TailScale has a standard env var names - I vaguely remember there is sth like TS_AUTHKEY. Lets align with them

@Conless
Copy link
Contributor Author

Conless commented Oct 28, 2024

After conducting lots of tests, I believe the hostname issue on TPU VMs has been fixed. This is the tailscale status output after launching a tpu-v2-32:1 cluster:

<ip> sky-bfe8-conless-xxx-tpu-0-xxxx conlesspan@  linux   -
<ip>   sky-bfe8-conless-xxx-tpu-1-xxxx conlesspan@  linux   -
<ip>  sky-bfe8-conless-xxx-tpu-2-xxxx conlesspan@  linux   -
<ip>   sky-bfe8-conless-xxx-tpu-3-xxxx conlesspan@  linux   -

and these entries are also correctly removed after sky down.

@Conless Conless requested a review from cblmemo October 28, 2024 06:22
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 28, 2024

Hi @Conless ! Could you resolve the merge conflict when you have time?

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @Conless ! Mostly looks good to me. Will do some test later. Left some nits first ;)

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/provision/common.py Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Nov 18, 2024

Hi @Conless , I tried this YAML and it seems like our code still created a dedicated security groups for this cluster. Could you help fix this?

Also, it would be great to resolve the merge conflict :))

resources:
  cloud: aws
  cpus: 2
  ports: 18273

vpn:
  tailscale: true

run: python -m http.server 18273
$ sky launch @temp/a.yaml -c t-vpn 
Task from YAML spec: @temp/a.yaml
Considered resources (1 node):
----------------------------------------------------------------------------------------
 CLOUD   INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
----------------------------------------------------------------------------------------
 AWS     m6i.large   2       8         -              us-east-1     0.10          ✔     
----------------------------------------------------------------------------------------
Launching a new cluster 't-vpn'. Proceed? [Y/n]: 
⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
└── Instance is up.
✓ Cluster launched: t-vpn.  View logs at: ~/sky_logs/sky-2024-11-18-14-40-15-754049/provision.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(task, pid=2465) Serving HTTP on 0.0.0.0 port 18273 (http://0.0.0.0:18273/) ...
(task, pid=2465) 100.77.86.89 - - [18/Nov/2024 22:42:43] "GET / HTTP/1.1" 200 -
image

@Conless
Copy link
Contributor Author

Conless commented Nov 19, 2024

Hi @Conless , I tried this YAML and it seems like our code still created a dedicated security groups for this cluster. Could you help fix this?

Hi @cblmemo ! After investigating this issue, I found that the creation of the security group is not because of the _open_ports() logic mentioned before, but because of

skypilot/sky/clouds/aws.py

Lines 440 to 443 in 6c02197

if resources.ports is not None:
# Already checked in Resources._try_validate_ports
security_group = USER_PORTS_SECURITY_GROUP_NAME.format(
cluster_name.display_name)

I pushed a quick fix that can solve this issue, by removing resources.ports when VPN is enabled.

sky/task.py Outdated
Comment on lines 492 to 493
if task.vpn_config is not None:
resources_config.pop('ports', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should pop the resources section as a whole - this will also be displayed at sky status and user might get confused about it. Could we change our provision logic instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump on this comment: it seems like we are still not showing the ports field in the resources after provisioning.

(sky-serve) ➜  skypilot git:(vpn-enhanced) sky launch @temp/vpn.yaml
Task from YAML spec: @temp/vpn.yaml
Considered resources (1 node):
----------------------------------------------------------------------------------------
 CLOUD   INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
----------------------------------------------------------------------------------------
 AWS     m6i.large   2       8         -              us-east-1     0.10          ✔     
----------------------------------------------------------------------------------------
Launching a new cluster 'sky-8672-txia'. Proceed? [Y/n]: 
⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
└── Instance is up.
⠧ Preparing SkyPilot runtime (3/3 - runtime)  View logs at: ~/sky_logs/sky-2024-12-21-03-18
⠏ Preparing SkyPilot runtime (3/3 - runtime)  View logs at: ~/sky_logs/sky-2024-12-21-03-18
⠋ Preparing SkyPilot runtime (3/3 - runtime)  View logs at: ~/sky_logs/sky-2024-12-21-03-18
✓ Cluster launched: sky-8672-txia.  View logs at: ~/sky_logs/sky-2024-12-21-03-18-33-905960/provision.log
Run commands not specified or empty.

Cluster name: sky-8672-txia
├── To log into the head VM:    ssh sky-8672-txia
├── To submit a job:            sky exec sky-8672-txia yaml_file
├── To stop the cluster:        sky stop sky-8672-txia
└── To teardown the cluster:    sky down sky-8672-txia

(sky-serve) ➜  skypilot git:(vpn-enhanced) sky status sky-8672-txia
Clusters
NAME           LAUNCHED        RESOURCES          STATUS  AUTOSTOP  COMMAND                    
sky-8672-txia  a few secs ago  1x AWS(m6i.large)  UP      -         sky launch @temp/vpn.yaml
# @temp/vpn.yaml
resources:
  cloud: aws
  cpus: 2
  ports:
    - 8080
    - 7000-7034
    - 60000
vpn:
  tailscale: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that the resources shown in sky status are exactly the resources we pass to the provisioned... let me find another solution

@cblmemo cblmemo mentioned this pull request Jan 8, 2025
17 tasks
@cblmemo cblmemo self-requested a review January 8, 2025 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants