Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"No devices found. Waiting indefinitely" should crash the pod #1080

Open
blame19 opened this issue Dec 2, 2024 · 0 comments
Open

"No devices found. Waiting indefinitely" should crash the pod #1080

blame19 opened this issue Dec 2, 2024 · 0 comments

Comments

@blame19
Copy link

blame19 commented Dec 2, 2024

I'm facing this sporadic issue on amazon EC2.

The nvidia device ds (image nvcr.io/nvidia/k8s-device-plugin:v0.17.0) silently fails with these logs:

 "Starting NVIDIA Device Plugin" version=<
        d475b2cf
        commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
 >
I1129 11:18:42.706636       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1129 11:18:42.706683       1 main.go:245] Starting OS watcher.
I1129 11:18:42.706950       1 main.go:260] Starting Plugins.
I1129 11:18:42.706983       1 main.go:317] Loading configuration.
I1129 11:18:42.707810       1 main.go:342] Updating config with default resource matching patterns.
I1129 11:18:42.708036       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1129 11:18:42.708051       1 main.go:356] Retrieving plugins.
I1129 11:18:46.704792       1 main.go:381] No devices found. Waiting indefinitely.

Now, on the outside the pod still appear as 'Running' and produces no failing events. I only realize this is happening due to my workloads not scheduling properly (e.g. Karpenter's nodeclaims noticing the requested resources are not there).

Monitoring wise I could probably set up something to watch the logs and alert us, but I'd like to challenge the 'Waiting indefinitely' part of the log. If the container was to send an error signal to k8s, an event, or just fall in a CrashLoopBack cycle, it would be 1. easier to detect and 2. more idiomatic to normal k8s workflows.

I imagine there's a reason behind the decision to wait, but it would be great to have a retry mechanism and / or a failure event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant