[AWS] Support EFA for P5/P5e instances #4062

Michaelvll · 2024-10-10T17:55:27Z

To enable EFA:

aws:
    enable_efa: true
    use_internal_ips: true
    vpc_name: my-vpc

sky launch --gpus A100:8 --num-nodes 2 -c test-efa

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

zaptrem · 2024-10-24T06:56:45Z

Thanks for this! A few issues:

After setting up the VPC and proxy node I was able to create the cluster but it failed to set itself up. It was complaining about the skypilot activation script not existing but when I ssh'ed in it did exist, so I'm not sure why this is happening: https://gist.github.com/zaptrem/1c544ebc5b5958b8d5c6eefe2616de91 . As a result I can't run any jobs.

Edit: it appears the first node was set up correctly but the second was not? It seems like the rsync command to move the skypilot runtime over is failing but after a few hours I can't figure out why. Most verbose output I could get: https://gist.github.com/zaptrem/9da876d43963f118a587ea9eb030d812

It seems a little unwieldly that we must turn off private ips/efa to create the head node through which we will proxy to access the vpc then turn the setting back on after. Could the creation of the head node be automated? It also wasn't clear that we needed to add name tags to the vpc/security group after the creation command from the faq guide. Also, the guide doesn't mention that we need to add a rule to enable connecting to the head node via ssh.
Could EFA/VPC/Security group be moved to the task yaml instead of global yaml? I'd like to keep running spot jobs like normal while my capacity block is going brr.
I wasn't able to launch h200 jobs (that were part of my capacity block) until I manually added a fake price to the catalog.

zaptrem · 2024-10-25T19:47:35Z

@Michaelvll apologies for the double ping. Do you know what may be causing the issue with rclone above? I'd like to try this with our training runs this week but unfortunately haven't yet cracked the above error.

Michaelvll added 6 commits October 10, 2024 03:09

EFA wip

46892cf

Only support EFA for p5

dea259e

format

59c8835

Add docs

2d0841b

fix setting for network index

019c874

fix unittest

3110c42

Michaelvll mentioned this pull request Oct 23, 2024

add how to use EFA to faq.rst #3818

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWS] Support EFA for P5/P5e instances #4062

[AWS] Support EFA for P5/P5e instances #4062

Michaelvll commented Oct 10, 2024 •

edited

Loading

zaptrem commented Oct 24, 2024 •

edited

Loading

zaptrem commented Oct 25, 2024

[AWS] Support EFA for P5/P5e instances #4062

Are you sure you want to change the base?

[AWS] Support EFA for P5/P5e instances #4062

Conversation

Michaelvll commented Oct 10, 2024 • edited Loading

zaptrem commented Oct 24, 2024 • edited Loading

zaptrem commented Oct 25, 2024

Michaelvll commented Oct 10, 2024 •

edited

Loading

zaptrem commented Oct 24, 2024 •

edited

Loading