Add gpu uuids to node labels #1015

xiongzubiao · 2024-10-26T01:07:55Z

I know it is easy to get it with nvidia-smi. It would be nice that the gpu-feature-discovery exposes it as a label of nodes, so that one doesn't need to ssh into the node.

elezar · 2024-10-31T13:26:36Z

@xiongzubiao could you describe how you would want to use these labels? In general the labels are intented to allow selection of specific nodes through node selectors or affinity. Is there a use case that you have which requires you to match nodes by UUID?

xiongzubiao · 2024-10-31T16:23:27Z

It is mainly for metering and diagnosis purpose. We'd like to monitor the usage and the health status of each GPU. Having UUIDs in node label can help us to search data in prometheus.

We don't have a use case to select a particular GPU right now. I guess that could be useful if there are multiple GPUs on a node, but models are not exactly the same?

xiongzubiao · 2024-11-20T01:07:15Z

@elezar Would you be interested if I submit a PR? I figured out that it is not that difficult to expose the UUIDs by leveraging existing functions. The label would look like: nvidia.com/gpu.uuid=GPU-d46f8b5f-76b0-e058-74a8-f82243117fd7,GPU-2871653f-019a-db66-ee74-bbcaece54c8b.

wangweihong · 2024-12-25T04:14:49Z

It is mainly for metering and diagnosis purpose. We'd like to monitor the usage and the health status of each GPU. Having UUIDs in node label can help us to search data in prometheus.

We don't have a use case to select a particular GPU right now. I guess that could be useful if there are multiple GPUs on a node, but models are not exactly the same?

In my case, we need to schedule pod to specific gpu . we has a map record all gpus(uuid), assign gpu to different job.

we run pod with env as below to make sure pod use specific gpu.

         env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: GPU-7dfa7b70-34fb-bcce-e2df-130ebb7fd047

However, if the pod is scheduled to a host that does not own the specified GPU, the pod will fail to run and return an error as follows:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: device error: GPU-7dfa7b70-34fb-bcce-e2df-130ebb7fd047: unknown device: unknown

Therefore, we must record the mapping between node names and GPU UUIDs. When running a pod, both the node name label and the GPU UUID environment variable should be specified. It would be even better if the GPU UUID could be directly provided from the node label.

shan100github · 2025-01-08T01:49:11Z

@xiongzubiao
I believe DCGM exporter gives gpu id as a part of metrics, probably that can be used for GPU monitoring.

xiongzubiao · 2025-01-09T18:08:59Z

@xiongzubiao I believe DCGM exporter gives gpu id as a part of metrics, probably that can be used for GPU monitoring.

@shan100github Thanks, I am aware of it. In my case I need to know the device UUIDs without querying DCGM exporter or Prometheus. It is the best that it comes from node label as it is a node property.

xiongzubiao changed the title ~~Add gpu uuids to node lables~~ Add gpu uuids to node labels Jan 9, 2025

xiongzubiao linked a pull request Jan 9, 2025 that will close this issue

Expose device UUIDs to node label #1116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu uuids to node labels #1015

Add gpu uuids to node labels #1015

xiongzubiao commented Oct 26, 2024

elezar commented Oct 31, 2024

xiongzubiao commented Oct 31, 2024

xiongzubiao commented Nov 20, 2024

wangweihong commented Dec 25, 2024

shan100github commented Jan 8, 2025 •

edited

Loading

xiongzubiao commented Jan 9, 2025

Add gpu uuids to node labels #1015

Add gpu uuids to node labels #1015

Comments

xiongzubiao commented Oct 26, 2024

elezar commented Oct 31, 2024

xiongzubiao commented Oct 31, 2024

xiongzubiao commented Nov 20, 2024

wangweihong commented Dec 25, 2024

shan100github commented Jan 8, 2025 • edited Loading

xiongzubiao commented Jan 9, 2025

shan100github commented Jan 8, 2025 •

edited

Loading