Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu uuids to node labels #1015

Open
xiongzubiao opened this issue Oct 26, 2024 · 6 comments · May be fixed by #1116
Open

Add gpu uuids to node labels #1015

xiongzubiao opened this issue Oct 26, 2024 · 6 comments · May be fixed by #1116

Comments

@xiongzubiao
Copy link

I know it is easy to get it with nvidia-smi. It would be nice that the gpu-feature-discovery exposes it as a label of nodes, so that one doesn't need to ssh into the node.

@elezar
Copy link
Member

elezar commented Oct 31, 2024

@xiongzubiao could you describe how you would want to use these labels? In general the labels are intented to allow selection of specific nodes through node selectors or affinity. Is there a use case that you have which requires you to match nodes by UUID?

@xiongzubiao
Copy link
Author

It is mainly for metering and diagnosis purpose. We'd like to monitor the usage and the health status of each GPU. Having UUIDs in node label can help us to search data in prometheus.

We don't have a use case to select a particular GPU right now. I guess that could be useful if there are multiple GPUs on a node, but models are not exactly the same?

@xiongzubiao
Copy link
Author

@elezar Would you be interested if I submit a PR? I figured out that it is not that difficult to expose the UUIDs by leveraging existing functions. The label would look like: nvidia.com/gpu.uuid=GPU-d46f8b5f-76b0-e058-74a8-f82243117fd7,GPU-2871653f-019a-db66-ee74-bbcaece54c8b.

@wangweihong
Copy link

It is mainly for metering and diagnosis purpose. We'd like to monitor the usage and the health status of each GPU. Having UUIDs in node label can help us to search data in prometheus.

We don't have a use case to select a particular GPU right now. I guess that could be useful if there are multiple GPUs on a node, but models are not exactly the same?

In my case, we need to schedule pod to specific gpu . we has a map record all gpus(uuid), assign gpu to different job.

we run pod with env as below to make sure pod use specific gpu.

         env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: GPU-7dfa7b70-34fb-bcce-e2df-130ebb7fd047

However, if the pod is scheduled to a host that does not own the specified GPU, the pod will fail to run and return an error as follows:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: device error: GPU-7dfa7b70-34fb-bcce-e2df-130ebb7fd047: unknown device: unknown

Therefore, we must record the mapping between node names and GPU UUIDs. When running a pod, both the node name label and the GPU UUID environment variable should be specified. It would be even better if the GPU UUID could be directly provided from the node label.

@shan100github
Copy link

shan100github commented Jan 8, 2025

@xiongzubiao
I believe DCGM exporter gives gpu id as a part of metrics, probably that can be used for GPU monitoring.

@xiongzubiao xiongzubiao changed the title Add gpu uuids to node lables Add gpu uuids to node labels Jan 9, 2025
@xiongzubiao
Copy link
Author

@xiongzubiao I believe DCGM exporter gives gpu id as a part of metrics, probably that can be used for GPU monitoring.

@shan100github Thanks, I am aware of it. In my case I need to know the device UUIDs without querying DCGM exporter or Prometheus. It is the best that it comes from node label as it is a node property.

@xiongzubiao xiongzubiao linked a pull request Jan 9, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants