-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support Tailscale VPN #4025
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @Conless ! This would be really helpful for our users. Several things to double-check:
- We might not want to open any ports if the vpn is enabled
- Make sure multi-node cluster works
- Investigate if it is possible to reduce the number of API key. Earlier this year it requires two types of API key to perform all operations we need, but not sure if the expand their api library.
Thanks for your comment and suggestions @cblmemo! I've resolved most of them, but a few remain for further discussion:
|
Thanks for the quick fix @Conless ! For the first one, lets use API only as currently we also need the user to provide this key anyway. I dont think we can skip the removal process without raise an error as this will leave some unhandled remains in user's infra and our user does not like that. Providing fewer keys is both more secure and more simple for the users. We can wait for user's feedback on if we want to move it back ;) |
Hi @cblmemo ! I've updated the implementation based on your suggestions, and finished those tests
Specifically, I tried to launch a cluster with two nodes and serve a task on AWS, and (1) they can be launched and accessed with the VPN IP shown in Are there any other things that you think may need double-check? |
Thanks for the test @Conless ! It looks great to me. One last thing that might need to double check is that if the distributed inference on multi-node is still working on this setting ;) |
Test done @cblmemo ! I've tried to train ResNet model on 2 nodes and serve vicuna on V100:4 using vLLM, and it seems that they all work well. |
Lets add a todo for adding this config to |
I tried this and check the cloud console, and seems like we still created a dedicated security group for this instance, though no extra port is opened. Can we make it to reuse the universal security group? If we are using a dedicated security group, we need to wait for terminate resource group, which is very time-consuming. Same goes for GCP and other cloud implementations resources:
cloud: aws
cpus: 2
ports: 18273
vpn:
tailscale: true
run: python -m http.server 18273 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @Conless ! Left some discussion :))
Also, please investigate if TailScale has a standard env var names - I vaguely remember there is sth like |
After conducting lots of tests, I believe the hostname issue on TPU VMs has been fixed. This is the
and these entries are also correctly removed after |
Hi @Conless ! Could you resolve the merge conflict when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @Conless ! Mostly looks good to me. Will do some test later. Left some nits first ;)
Hi @Conless , I tried this YAML and it seems like our code still created a dedicated security groups for this cluster. Could you help fix this? Also, it would be great to resolve the merge conflict :)) resources:
cloud: aws
cpus: 2
ports: 18273
vpn:
tailscale: true
run: python -m http.server 18273 $ sky launch @temp/a.yaml -c t-vpn
Task from YAML spec: @temp/a.yaml
Considered resources (1 node):
----------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------
AWS m6i.large 2 8 - us-east-1 0.10 ✔
----------------------------------------------------------------------------------------
Launching a new cluster 't-vpn'. Proceed? [Y/n]:
⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
└── Instance is up.
✓ Cluster launched: t-vpn. View logs at: ~/sky_logs/sky-2024-11-18-14-40-15-754049/provision.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(task, pid=2465) Serving HTTP on 0.0.0.0 port 18273 (http://0.0.0.0:18273/) ...
(task, pid=2465) 100.77.86.89 - - [18/Nov/2024 22:42:43] "GET / HTTP/1.1" 200 - ![]() |
Hi @cblmemo ! After investigating this issue, I found that the creation of the security group is not because of the Lines 440 to 443 in 6c02197
I pushed a quick fix that can solve this issue, by removing |
sky/task.py
Outdated
if task.vpn_config is not None: | ||
resources_config.pop('ports', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should pop the resources section as a whole - this will also be displayed at sky status
and user might get confused about it. Could we change our provision logic instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump on this comment: it seems like we are still not showing the ports field in the resources after provisioning.
(sky-serve) ➜ skypilot git:(vpn-enhanced) sky launch @temp/vpn.yaml
Task from YAML spec: @temp/vpn.yaml
Considered resources (1 node):
----------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------
AWS m6i.large 2 8 - us-east-1 0.10 ✔
----------------------------------------------------------------------------------------
Launching a new cluster 'sky-8672-txia'. Proceed? [Y/n]:
⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
└── Instance is up.
⠧ Preparing SkyPilot runtime (3/3 - runtime) View logs at: ~/sky_logs/sky-2024-12-21-03-18
⠏ Preparing SkyPilot runtime (3/3 - runtime) View logs at: ~/sky_logs/sky-2024-12-21-03-18
⠋ Preparing SkyPilot runtime (3/3 - runtime) View logs at: ~/sky_logs/sky-2024-12-21-03-18
✓ Cluster launched: sky-8672-txia. View logs at: ~/sky_logs/sky-2024-12-21-03-18-33-905960/provision.log
Run commands not specified or empty.
Cluster name: sky-8672-txia
├── To log into the head VM: ssh sky-8672-txia
├── To submit a job: sky exec sky-8672-txia yaml_file
├── To stop the cluster: sky stop sky-8672-txia
└── To teardown the cluster: sky down sky-8672-txia
(sky-serve) ➜ skypilot git:(vpn-enhanced) sky status sky-8672-txia
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-8672-txia a few secs ago 1x AWS(m6i.large) UP - sky launch @temp/vpn.yaml
# @temp/vpn.yaml
resources:
cloud: aws
cpus: 2
ports:
- 8080
- 7000-7034
- 60000
vpn:
tailscale: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that the resources shown in sky status
are exactly the resources we pass to the provisioned... let me find another solution
This pull request is the successor of #3989. It integrates the Tailscale VPN in the core logic of SkyPilot, and designs a cloud-independent implementation that allows users to connect to any cloud instance they launch via SkyPilot through Tailscale.
Now you only need to set these environment variables:
and add the
vpn
field in config yaml:or the same in SkyServe:
Then all the launched instances will be within the private network. For example, in SkyServe, the
sky serve status
output will be like this if VPN is enabled.Then if you run
they all go through private network and work well.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh