-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: [INFRA-2749] change labels to match for gpu to use gpu-based in…
…stances for build (#119) * feat: [INFRA-2749] change label to gpu for build on gpu-based instances * feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for build * feat: [INFRA-2749] change labels to match for gpu to use gpu-based instances for tests * ci: try running E2E test on self-hosted gpu runner * ci: install some missing packages * ci: also apt update * install cuda and nccl * cuda 12.1 * new try * more deps * more * more * manual * quiet * deb * quiet * another try * up * remove existing * up * up * up * install all deps * up * up * up * up * up * up * nccl * don't install nccl from source * up * up * up * up * cuda home * export path and cache * fix * use 12.1 * cuda 12.2 * up * up * chore: Use base docker image (#132) * Use base docker image * add curl * incorrect version fix * do not use gpu on cis where not needed * cleanup --------- Co-authored-by: Philipp Sippl <[email protected]> * dbg print cuda * up * . * up * new try * dbg * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * test * revert container options * test * Trigger Build * remove fstab changes * try NCCL_P2P_DIRECT_DISABLE * disable shm * also run normal tests * remove tests again * dbg * dbg * revert dbg * limit db * revert limit * only random * no insertions * sync * add compute-sanitizer test to ci * fix broken shell command * install compute-sanitizer as well * test * test * hardcode compute-sanitizer path * disable write * dbg * dbg * test waiting for all work to finish per job * only 1 batch * 2 batches * 3 batches * reducing complexity, db_sizes no longer need to be mutexed * make e2e test deterministic, add super sync after each iteration * only use 1 stream atm * alloc on streams * another sync after alloc * sync after each op * tracing * try not freeing stuff * Revert "try not freeing stuff" This reverts commit 987fa80. * Revert "tracing" This reverts commit 7569f98. * test replacing ptr casts with normal cuda types * Revert "test replacing ptr casts with normal cuda types" This reverts commit c480df7. * try dirty hack to use default streams * don't access null ptr streams * Revert "don't access null ptr streams" This reverts commit 4e04d49. * Revert "try dirty hack to use default streams" This reverts commit 7972c61. * Revert "sync after each op" This reverts commit 03aadd7. * log mem addresses * Revert "log mem addresses" This reverts commit 331257b. * remove bind thread * dbg * dbg * dbg * dbg * dbg * dbg * up * dbg * dbg * dbg * dbg * dbg * up * up * up * up * up * dbg: max for all * up * up * up * up * up * up * up * 2 byte aligned * odd len in phase 2 * dbg * up * up * cublas test * up * up * up * up * add asserts * cublasGetStatusString * remove cuda test * update batch size in server * PR feedback * Revert "alloc on streams" This reverts commit 898eb13. * fmt --------- Co-authored-by: Daniel Kales <[email protected]> Co-authored-by: philsippl <[email protected]> Co-authored-by: wojciechsromek <[email protected]> Co-authored-by: wojciechsromek <[email protected]>
- Loading branch information
1 parent
d6bc404
commit d725f84
Showing
11 changed files
with
239 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
name: Rust GPU Tests | ||
|
||
on: | ||
push: | ||
|
||
concurrency: | ||
group: "${{ github.workflow }} @ ${{ github.event.pull_request.head.label || github.head_ref || github.ref }}" | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
e2e: | ||
runs-on: gpu | ||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Validate presence of GPU devices | ||
run: nvidia-smi | ||
|
||
- name: Check shared memory size | ||
run: df -h | ||
|
||
- name: Install OpenSSL && pkg-config | ||
run: sudo apt-get update && sudo apt-get install -y pkg-config libssl-dev | ||
|
||
- name: Install CUDA and NCCL dependencies | ||
if: steps.cache-cuda-nccl.outputs.cache-hit != 'true' | ||
env: | ||
DEBIAN_FRONTEND: noninteractive | ||
run: | | ||
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb | ||
sudo dpkg -i cuda-keyring_1.1-1_all.deb | ||
sudo apt update | ||
sudo apt install -y cuda-toolkit-12-2 libnccl2 libnccl-dev | ||
- name: Find libs | ||
run: find /usr -name "libnvrtc*" && find /usr -name libcuda.so | ||
|
||
- name: Cache Rust build | ||
uses: actions/cache@v3 | ||
id: cache-rust | ||
with: | ||
path: | | ||
~/.cargo/registry | ||
~/.cargo/git | ||
target | ||
key: rust-build-${{ runner.os }}-${{ hashFiles('**/Cargo.lock') }} | ||
restore-keys: | | ||
rust-build-${{ runner.os }}- | ||
- name: Find libs | ||
run: find /usr -name "libnvrtc*" && find /usr -name libcuda.so | ||
|
||
- name: Install Rust nightly | ||
uses: dtolnay/rust-toolchain@master | ||
with: | ||
toolchain: nightly | ||
|
||
- name: E2E Tests | ||
run: cargo test --release e2e | ||
shell: bash | ||
env: | ||
NCCL_P2P_LEVEL: LOC | ||
NCCL_NET: Socket | ||
NCCL_P2P_DIRECT_DISABLE: 1 | ||
NCCL_SHM_DISABLE: 1 | ||
|
||
e2e-sanitizer: | ||
runs-on: gpu | ||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Validate presence of GPU devices | ||
run: nvidia-smi | ||
|
||
- name: Check shared memory size | ||
run: df -h | ||
|
||
- name: Install OpenSSL && pkg-config | ||
run: sudo apt-get update && sudo apt-get install -y pkg-config libssl-dev | ||
|
||
- name: Install CUDA and NCCL dependencies | ||
if: steps.cache-cuda-nccl.outputs.cache-hit != 'true' | ||
env: | ||
DEBIAN_FRONTEND: noninteractive | ||
run: | | ||
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb | ||
sudo dpkg -i cuda-keyring_1.1-1_all.deb | ||
sudo apt update | ||
sudo apt install -y cuda-toolkit-12-2 cuda-command-line-tools-12-2 libnccl2 libnccl-dev | ||
- name: Find libs | ||
run: find /usr -name "libnvrtc*" && find /usr -name libcuda.so | ||
|
||
- name: Cache Rust build | ||
uses: actions/cache@v3 | ||
id: cache-rust | ||
with: | ||
path: | | ||
~/.cargo/registry | ||
~/.cargo/git | ||
target | ||
key: rust-build-${{ runner.os }}-${{ hashFiles('**/Cargo.lock') }} | ||
restore-keys: | | ||
rust-build-${{ runner.os }}- | ||
- name: Find libs | ||
run: find /usr -name "libnvrtc*" && find /usr -name libcuda.so | ||
|
||
- name: Find compute-sanitizer | ||
run: find /usr -name "compute-sanitizer" | ||
|
||
- name: Install Rust nightly | ||
uses: dtolnay/rust-toolchain@master | ||
with: | ||
toolchain: nightly | ||
|
||
- name: Build e2e test | ||
run: cargo test --release e2e --no-run | ||
|
||
- name: Build e2e test and grab executable name | ||
run: echo TEST_NAME=$(cargo --color=never test --release e2e --no-run 2>&1 | grep "Executable tests/e2e.rs" | sed "s/.*(\(.*\))/\1/") >> $GITHUB_OUTPUT | ||
id: build-e2e | ||
|
||
- name: E2E Tests w/ compute-sanitizer | ||
run: /usr/local/cuda-12.2/bin/compute-sanitizer --tool=memcheck ${{ steps.build-e2e.outputs.TEST_NAME }} --nocapture | ||
env: | ||
NCCL_DEBUG: info | ||
NCCL_P2P_LEVEL: LOC | ||
NCCL_NET: Socket | ||
NCCL_P2P_DIRECT_DISABLE: 1 | ||
NCCL_SHM_DISABLE: 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.