Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new MultiNodeEnvironment CRD to setup the environment needed for running GPU workloads across multi-nodes #225

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

klueska
Copy link
Collaborator

@klueska klueska commented Jan 9, 2025

No description provided.

@klueska klueska force-pushed the add-multi-node-crd branch 10 times, most recently from 38065b4 to 3e51cd8 Compare January 14, 2025 08:15
@klueska klueska force-pushed the add-multi-node-crd branch 20 times, most recently from 4ce9bdb to 0d435d8 Compare January 22, 2025 15:56
For now they are just clones of each other, but subsequent commits will
fine tune them to either the GPU driver or the IMEX driver.

Signed-off-by: Kevin Klues <[email protected]>
github.com/elezar/nvidia-container-toolkit/tree/add-imex-binaries

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch from c8b9975 to 0a96b95 Compare January 25, 2025 23:06
This is in prepration for calling RemoveFinalizer on all components in
one standardized location.

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch 2 times, most recently from 800d905 to e598f17 Compare January 27, 2025 23:48
We no longer need this code now that we have a finalizer on the
ComputeDomain object itself that only gets removed if all other linked
objects have been removed.

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch from e598f17 to 89e085f Compare January 28, 2025 12:46
For now we are conditioning it on the hardcoded value of 'true', but in
the future we want to condition it on the absenece of workloads running
in the compute domain neing deleted.

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch from 89e085f to bf742a7 Compare January 28, 2025 12:58
@klueska klueska force-pushed the add-multi-node-crd branch from 76f751a to 970fb95 Compare January 28, 2025 17:13
@klueska klueska force-pushed the add-multi-node-crd branch from 970fb95 to 3b7a65b Compare January 28, 2025 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant