You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now, on the outside the pod still appear as 'Running' and produces no failing events. I only realize this is happening due to my workloads not scheduling properly (e.g. Karpenter's nodeclaims noticing the requested resources are not there).
Monitoring wise I could probably set up something to watch the logs and alert us, but I'd like to challenge the 'Waiting indefinitely' part of the log. If the container was to send an error signal to k8s, an event, or just fall in a CrashLoopBack cycle, it would be 1. easier to detect and 2. more idiomatic to normal k8s workflows.
I imagine there's a reason behind the decision to wait, but it would be great to have a retry mechanism and / or a failure event.
The text was updated successfully, but these errors were encountered:
I'm facing this sporadic issue on amazon EC2.
The nvidia device ds (image nvcr.io/nvidia/k8s-device-plugin:v0.17.0) silently fails with these logs:
Now, on the outside the pod still appear as 'Running' and produces no failing events. I only realize this is happening due to my workloads not scheduling properly (e.g. Karpenter's nodeclaims noticing the requested resources are not there).
Monitoring wise I could probably set up something to watch the logs and alert us, but I'd like to challenge the 'Waiting indefinitely' part of the log. If the container was to send an error signal to k8s, an event, or just fall in a CrashLoopBack cycle, it would be 1. easier to detect and 2. more idiomatic to normal k8s workflows.
I imagine there's a reason behind the decision to wait, but it would be great to have a retry mechanism and / or a failure event.
The text was updated successfully, but these errors were encountered: