-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94
Comments
Experienced this again week of Dec 16 - Dec 20th ( primarily on Dec 17th Here are some screenshots with what we observed
Interesting ones to note was that one node entered a ReadOnly Filesystem mode. We had to manually kill the node via the gcloud console |
Hey, With regards to the following:
do you remember if you've seen the error happening for long periods of time, and it turns out that we do have the proper diffing in place during the flow of e.g., during the
later, during
ps.: by default, that This means that as long as the worker gets up, after 5m (or whatever time is To see this in practice, change diff --git a/docker-compose.yml b/docker-compose.yml
index 400598ee0..cf8e83957 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -41,12 +41,10 @@ services:
- 7788:7788
stop_signal: SIGUSR2
environment:
+ CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+ CONCOURSE_BAGGAGECLAIM_DRIVER: naive
+ CONCOURSE_BIND_IP: 0.0.0.0
CONCOURSE_LOG_LEVEL: debug
CONCOURSE_TSA_HOST: web:2222
-
- # avoid using loopbacks
- CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
-
- # so we can reach Garden/Baggageclaim for debugging
- CONCOURSE_BIND_IP: 0.0.0.0
- CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+ CONCOURSE_WORK_DIR: /tmp
+ CONCOURSE_NAME: worker then run task. this will incur in volumes being created. now, ungracefully remove the worker ( this will bring the worker back with a mismatch between its state (no volumes if you try to run the task again, that exact error will pop up, but after few minutes, |
Yeah, that's because prior to gc containers and vols present in the worker but Some more context on the shutdown behaviorDuring a graceful shutdown, the worker retires:
by always removing them all during initialization, we can ensure that they would e.g.:
but then, if there's an ungraceful shutdown (anything that does not go
As I've mentioned in the previous comment, the mismatch should be fine - on few |
oh, actually, a
hmmm I don't think that's really the answer as that's something that's up to the e.g.:
|
tl;drHaving both remount volumes on restart and gc containers and vols present in the worker but not on the db, it seems sensible to me to consider going the route of not removing the volumes on initialization in any case. (this can be activated in the chart via Given that we now diff on both ways, and have the volumes properly mounted when getting up (in any case), I think we're safe to go without initial removals - Concourse will take care of removing any leftovers on either sides. ps.: while I'm aware of the fixes for doing the proper mounting for overlay, I'm not entirely sure if everything would still be alright for |
@cirocosta I think we might have about this previously, but my memory is failing me: workers coming back under the same goes against the intent of the current worker lifecycle design, where the name is supposed to directly correspond to the lifecycle of both the processes and data (i.e. containers and volumes) managed by the worker. If they go away, the worker should just get a new name. These kinds of problems seem to be caused by not following that rule. What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + |
heey @vito,
ooh, totally!
my impression is that with the way that nowadays we perform the diffs in both
We've been using non-ephemeral names just because of the use of
(from Using StatefulSets) As you mentioned, it diverges from what we want in terms of both the stable But, still, it gives us something that's very valuable: the ability to have Leveraging StatefulSets then, we can declare that we want our workers to come 1: we never tweaked 2: that's partially true - it's possible to get a deployment |
From what I understand, the best case would be to have PVCs (so that the worker With the way that kubernetes is extensible (e.g., see operator pattern), While I agree that by making Concourse very smart we can have all sort 1: the IOPS part is partially true - depending on vCPU, the IOPS can damn, sorry for those walls of text 😬 |
It does appear that based on our GC logic that volumes in the DB but not on the worker should be cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick ). Our hypothesis that the workers were in a degraded state (had 0 containers showing ~1-2 hrs after they seemed to have OOM and re-created by K8s) and was caused by missing volumes. Our assumption was that since volumes couldn't be found ( even base resources ) no containers could be created and all builds on those workers would continue failing indefinitely. However, since our GC should have been cleaning up the volumes from the DB, our assumption appears to be invalid. If this were to occur again, we can investigate further to validate why the workers remain in an unusable state; Concrete next steps when the worker appears to be degraded due to the same reason again;
|
slightly related: Research Linux OOM killer behavior for cgroups #20 |
We saw most likely the same problem happen to hush-house again on the v5.8 upgrade. We investigated it further and found out that it was caused by the volume collector failing to clean up missing volumes. It was failing to clean up any missing volumes because of this error |
Seeing this again on Hush House today.
@clarafu mentioned that this has been addressed in v6.x.x, hence, we're not going to investigate this further. we're just |
Related: https://github.com/pivotal/concourse-ops/issues/170
We have observed on Hush House that sometimes, a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior
Concourse doesn't know about this restart, and when the pod is recreated, it re-registers itself using the same name. However, when k8s recreates a pod in this way, it deletes the volumes; Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC. The symptoms of this are that Hush House users will see error messages like this:
Short term pain reduction
In the short term, we should look at emitting more descriptive errors and adding some operational guidelines and documentation around how to manage this problem. We believe that restarting pods using
kubectl
won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:Running
prune
will remove it from the ATC DB.Long term fixes?
We could look into modifying the hard eviction limits or seeing if there's a way we can add a lifecycle hook to fire off a retire-worker request or something to ATC, to try to keep ATC DB and the k8s cluster synced up.
We can also look into changing the behaviour of the init container so that volumes are not totally destroyed when restarts happen.
We really only want this to be true if a worker is ephemeral, but in the case of long-lived workers, it would be worthwhile to see if we can relocate and reattach those volumes when a pod comes back.spoke with @kcmannem - it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is backNote: We are using anti-affinity so we only have one pod per node, but it looks like if that one pod's memory is too high it still gets "evicted".
The text was updated successfully, but these errors were encountered: