Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Support EFA for P5/P5e instances #4062

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 10, 2024

To enable EFA:

aws:
    enable_efa: true
    use_internal_ips: true
    vpc_name: my-vpc

sky launch --gpus A100:8 --num-nodes 2 -c test-efa

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll mentioned this pull request Oct 23, 2024
5 tasks
@zaptrem
Copy link

zaptrem commented Oct 24, 2024

Thanks for this! A few issues:

  1. After setting up the VPC and proxy node I was able to create the cluster but it failed to set itself up. It was complaining about the skypilot activation script not existing but when I ssh'ed in it did exist, so I'm not sure why this is happening: https://gist.github.com/zaptrem/1c544ebc5b5958b8d5c6eefe2616de91 . As a result I can't run any jobs.

Edit: it appears the first node was set up correctly but the second was not? It seems like the rsync command to move the skypilot runtime over is failing but after a few hours I can't figure out why. Most verbose output I could get: https://gist.github.com/zaptrem/9da876d43963f118a587ea9eb030d812

  1. It seems a little unwieldly that we must turn off private ips/efa to create the head node through which we will proxy to access the vpc then turn the setting back on after. Could the creation of the head node be automated? It also wasn't clear that we needed to add name tags to the vpc/security group after the creation command from the faq guide. Also, the guide doesn't mention that we need to add a rule to enable connecting to the head node via ssh.
  2. Could EFA/VPC/Security group be moved to the task yaml instead of global yaml? I'd like to keep running spot jobs like normal while my capacity block is going brr.
  3. I wasn't able to launch h200 jobs (that were part of my capacity block) until I manually added a fake price to the catalog.

@zaptrem
Copy link

zaptrem commented Oct 25, 2024

@Michaelvll apologies for the double ping. Do you know what may be causing the issue with rclone above? I'd like to try this with our training runs this week but unfortunately haven't yet cracked the above error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants