Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

deniseyu · 2019-12-17T17:11:10Z

Related: https://github.com/pivotal/concourse-ops/issues/170

We have observed on Hush House that sometimes, a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior

Concourse doesn't know about this restart, and when the pod is recreated, it re-registers itself using the same name. However, when k8s recreates a pod in this way, it deletes the volumes; Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC. The symptoms of this are that Hush House users will see error messages like this:

volume '7615ac07-7c88-4555-6888-9e87f49317bf' disappeared from worker 'workers-worker-10'

Short term pain reduction

In the short term, we should look at emitting more descriptive errors and adding some operational guidelines and documentation around how to manage this problem. We believe that restarting pods using kubectl won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:

fly -t <target> land-worker -w <worker-name>
fly -t <target> prune-worker -w <worker-name>

Running prune will remove it from the ATC DB.

Long term fixes?

We could look into modifying the hard eviction limits or seeing if there's a way we can add a lifecycle hook to fire off a retire-worker request or something to ATC, to try to keep ATC DB and the k8s cluster synced up.

We can also look into changing the behaviour of the init container so that volumes are not totally destroyed when restarts happen. ~~We really only want this to be true if a worker is ephemeral, but in the case of long-lived workers, it would be worthwhile to see if we can relocate and reattach those volumes when a pod comes back.~~ spoke with @kcmannem - it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back

Note: We are using anti-affinity so we only have one pod per node, but it looks like if that one pod's memory is too high it still gets "evicted".

The text was updated successfully, but these errors were encountered:

xtreme-sameer-vohra · 2019-12-23T15:36:21Z

Experienced this again week of Dec 16 - Dec 20th ( primarily on Dec 17th
and 20th )

Here are some screenshots with what we observed

Dec 17th

Dec 20th

Note the workers with 0 containers, these had been restarted/recreated by K8s and while they were successfully registered, they were not in a usable state.

Interesting ones to note was that one node entered a ReadOnly Filesystem mode. We had to manually kill the node via the gcloud console

cirocosta · 2019-12-24T17:03:46Z

Hey,

With regards to the following:

Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC.

do you remember if you've seen the error happening for long periods of time, and
never recovering?

it turns out that we do have the proper diffing in place during the flow of
reporting volumes and getting them marked as missing or not.

e.g., during the ReportWorkerVolumes (iniated from the worker through
`volumesweeper) flow:

    i. `missing_since` gets updated during the ReportWorkerVolumes flow


            this is supposed to account for the case where the database has volumes
            that the worker doesn't

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).UpdateVolumesMissingSince
                            /src/atc/db/volume_repository.go:125 +0x9d5
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:65 +0xc68
                    net/http.HandlerFunc.ServeHTTP
                            /usr/local/go/src/net/http/server.go:2007 +0x44


            on the db, we can verify that being set:

                    concourse=# select handle, worker_name, missing_since from volumes;

                                    handle                | worker_name |         missing_since
                    --------------------------------------+-------------+-------------------------------
                     2a2cfbb8-ae97-4ada-4e80-6d8d64031cb7 | worker      | 2019-12-24 16:00:52.270141+00
                     89503487-3816-4bac-7b3c-b7109cc8b946 | worker      | 2019-12-24 16:00:52.270141+00
                    (2 rows)



    ii. volumes not known by the db get transitioned to `destroying`


            on a later tick, `destroy`ing volumes get actually destroyed by
            the volume sweeper (in the worker itself)

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).DestroyUnknownVolumes
                            /src/atc/db/volume_repository.go:577 +0x3d6
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:47 +0x859

later, during gc:

    i. volumes with `missing_since > threshold` get removed
            (literally, `DELETE`d)

ps.: by default, that threshold is 5min.

This means that as long as the worker gets up, after 5m (or whatever time is
configured in the installation), we should have the references being cleaned up.

To see this in practice, change docker-compose.yml to have a fixed-name and use a non-persistent directory for its volumes:

diff --git a/docker-compose.yml b/docker-compose.yml
index 400598ee0..cf8e83957 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -41,12 +41,10 @@ services:
     - 7788:7788
     stop_signal: SIGUSR2
     environment:
+      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_BAGGAGECLAIM_DRIVER: naive
+      CONCOURSE_BIND_IP: 0.0.0.0
       CONCOURSE_LOG_LEVEL: debug
       CONCOURSE_TSA_HOST: web:2222
-
-      # avoid using loopbacks
-      CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
-
-      # so we can reach Garden/Baggageclaim for debugging
-      CONCOURSE_BIND_IP: 0.0.0.0
-      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_WORK_DIR: /tmp
+      CONCOURSE_NAME: worker

then run task. this will incur in volumes being created.

now, ungracefully remove the worker (docker rm -f concourse_worker_1),
and then re-create it (docker-compose up -d worker would do it).

this will bring the worker back with a mismatch between its state (no volumes
in the worker) and the db (volumes marked as being in the worker).

if you try to run the task again, that exact error will pop up, but after few minutes,
web will get rid of the extra volumes in the db.

cirocosta · 2019-12-24T17:09:53Z

it looks like this was a deliberate engineering decision made for the Helm
chart design. Will have to chat with @cirocosta when he is back

Yeah, that's because prior to gc containers and vols present in the worker but
not on the db, the worker would come back with a set of volumes that would
differ what the db had, and that'd cause problems (as we'd never gc vols that
were in the worker but not in the db), thus, they'd accumulate forever, but it
seems like given that now we do have better logic for dealing with the
differences, we could think about not having the removal anymore (as the system
would go towards the right thing in any case).

Some more context on the shutdown behavior

During a graceful shutdown, the worker retires:

the preStop has a kill -s SIGUSR2 $concourse which leads to retire, which
means that in the db, the vols will go away, but not right away in the
worker - as that takes few ticks (it'll first become orphan, then marked as
for destruction, then destroyed by the worker - AFAIR)

by always removing them all during initialization, we can ensure that they would
not forever persist in that pvc

e.g.:

	worker-1
		vols: 1,2,3,4	(in the pvc: `vols/{1,2,3,4}`)



kubectl delete pod worker-1


t0		==> kill -s SIGUSR2 worker-1 (preStop)
t1		==> wait for the proc to die ...

t2	==> worker-1 retires
t3		==> container goes away
t4		==> pvc stays around
			vols/{1,2,3,4}

t5 	worker-1 gets scheduled to antoher node ...
t6		pvc-worker-1 gets attached to worker-1
			vols/{1,2,3,4} are there

t7	worker-1's initContainer gets started
		==> cleans `vols/` from the pvc

t8	worker-1 container starts
		==> state matches the `db` (no vols here, no vols there)

but then, if there's an ungraceful shutdown (anything that does not go
through preStop), then the db will not be cleaned up w/ regards to that
worker (as there'll never exist a kill -s SIGUSR2 $concourse that makes it
go through the retire process), and if it turns out to be a whole-pod shutdown, then
the initContainer will indeed remove the contents, whick will cause the mismatch:

	worker-1
		vols: 1,2,3,4	(in the pvc: `vols/{1,2,3,4}`)



	t0	==> destroy

As I've mentioned in the previous comment, the mismatch should be fine - on few
ticks, we'd see Concourse dealing with those volumes not existing in the worker
by marking them with missing_since, and later, having them deleted.

cirocosta · 2019-12-24T17:14:19Z

We believe that restarting pods using kubectl won't fix the problem of ATC
having an incorrect perception of workers' volumes; so we did the following:

oh, actually, a kubectl delete pod would end up putting the worker through
"the right termination process" which is going through preStop, which takes
care of doing the retire-worker before termination, which would've cleaned up
the database-side first, and then on re-initialization, cleaning up the pvc
(which would make vols_in_db=0 and vols_in_the_worker=0)

We could look into modifying the hard eviction limits

hmmm I don't think that's really the answer as that's something that's up to the
consumers of the Chart; even if we did remove those limits that we have set
under resources, it'd not really take the problem away - the kubelet could
still do a hard eviction on node-configured values that we don't know about as
pod spec authors (aside from any other form of ungraceful termination that will
eventually happen).

e.g.:

kubectl proxy --port=8001
curl -sSL "http://localhost:8001/api/v1/nodes/gke-hush-house-workers-3-687beb5a-3fgz/proxy/configz" | jq '.kubeletconfig.evictionHard'
{
  "memory.available": "100Mi",
  "nodefs.available": "10%",
  "nodefs.inodesFree": "5%",
  "pid.available": "10%"
}

cirocosta · 2019-12-24T17:28:20Z

tl;dr

Having both remount volumes on restart and gc containers and vols present in the worker but not on the db, it seems sensible to me to consider going the route of not removing the volumes on initialization in any case.

(this can be activated in the chart via --set worker.cleanUpWorkDirOnStart=false)

Given that we now diff on both ways, and have the volumes properly mounted when getting up (in any case), I think we're safe to go without initial removals - Concourse will take care of removing any leftovers on either sides.

ps.: while I'm aware of the fixes for doing the proper mounting for overlay, I'm not entirely sure if everything would still be alright for btrfs in the case where loopback devices are used (although I think it's alright already)

vito · 2019-12-25T03:35:11Z

@cirocosta I think we might have about this previously, but my memory is failing me: workers coming back under the same goes against the intent of the current worker lifecycle design, where the name is supposed to directly correspond to the lifecycle of both the processes and data (i.e. containers and volumes) managed by the worker. If they go away, the worker should just get a new name. These kinds of problems seem to be caused by not following that rule.

What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true? With the chart becoming the preferred method to deploy Concourse, it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'. Sorry if this is reviving an old discussion, since I think it is. 😅 If you remember where we chatted about this previously feel free to just link it.

cirocosta · 2019-12-27T14:50:40Z

heey @vito,

These kinds of problems seem to be caused by not following that rule.

ooh, totally!

it seems like we may want to either adjust the lifecycle design more
fundamentally or change how the chart works to 'fall in line'.

my impression is that with the way that nowadays we perform the diffs in both
ways, we're actually very close (if not already fine) to handle state persisting
around 🤔 (I might be very wrong though lol)

What's the technical reason for the chart to prefer 'stable' names as opposed
to just ephemeral names + ephemeral: true?

We've been using non-ephemeral names just because of the use of StatefulSets
for our workers (see worker-statefulset.yaml):

`StatefulSets` are valuable for applications that require one or more of the
following:

- **Stable, unique network identifiers**.
- **Stable, persistent storage**.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.

(from Using StatefulSets)

As you mentioned, it diverges from what we want in terms of both the stable
naming across pod starts¹, and having the storage trully persistent (regardless
of where the pod goes, the volume + its contents goes together with it) -
properties of StatefulSets.

But, still, it gives us something that's very valuable: the ability to have
templated persistent volume claims. This is what allows each pod to get a
persistent volume attached to it. Without that, we'd end up consuming whatever
storage is set up for ephemeral storage in that machine (in the case of GKE,
e.g., that'd be the same storage used for the whole machine, including the storage
for container images, other workloads, etc)².

Leveraging StatefulSets then, we can declare that we want our workers to come
with a 1TB disk and have that disk just for the worker pods, regardless of
whatever is used for the kubelet and other workloads in the machine (i.e.,
dedicated IOPS for the worker). With a Deployment, that's not possible.

¹: we never tweaked CONCOURSE_NAME in the chart in any form, it's
using the hostname that gets generated by default ($statefulset-$index), but
that could be changed, for sure.

²: that's partially true - it's possible to get a deployment
bind-mounting a local disk that's separate from the regular disk used to the
kubelet and others, you'd only need to request that extra local disk and make
sure you're the only one bind-mounting it, having prior-knowledge of what the
device name is, and also having it properly formatted, etc.

cirocosta · 2019-12-27T15:20:13Z

From what I understand, the best case would be to have PVCs (so that the worker
pod can have all the IOPS and storage to itself, regardless of where it
lands¹), with pods being ephemeral, and the PVCs' lifecycle bound to
the pods lifecycle (when pods go away, PVCs would go away too, so that we don't
keep any state around for pods that were terminated). i.e., a blend of
properties from Deployments and StatefulSets.

With the way that kubernetes is extensible (e.g., see operator pattern),
that's something that we could persue - have a CRD like
"EphemeralStatefulSet" and then manually control the lifecycle of pods and pvcs
in a controller. should we? I don't think so 😅

While I agree that by making Concourse very smart we can have all sort
of edge cases and problems to deal with, shooting ourselves in the foot, I think
this kind of resiliency that we're talking about here is worth having baked
in. Even in those cases of ephemeral names + ephemeral disk, we can get things
like a container restart (where the pod does not get recreated and the container
comes up faster than web would notice a stall) where we'd need to fix the
state; or have scenarios where we can't even have ephemeral: true given the
nature of unreliable networks.

¹: the IOPS part is partially true - depending on vCPU, the IOPS can
be different

damn, sorry for those walls of text 😬

xtreme-sameer-vohra · 2019-12-31T23:35:23Z

It does appear that based on our GC logic that volumes in the DB but not on the worker should be cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick ).

Our hypothesis that the workers were in a degraded state (had 0 containers showing ~1-2 hrs after they seemed to have OOM and re-created by K8s) and was caused by missing volumes. Our assumption was that since volumes couldn't be found ( even base resources ) no containers could be created and all builds on those workers would continue failing indefinitely.

However, since our GC should have been cleaning up the volumes from the DB, our assumption appears to be invalid.

If this were to occur again, we can investigate further to validate why the workers remain in an unusable state;

Concrete next steps when the worker appears to be degraded due to the same reason again;

look at DB state for the worker's volumes. Is the missing_since flag being set ?
if so, do we observe the volumes being cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick on web ).
What errors are we seeing in the logs ( IIRC we were observing FindOrCreateContainer errors) ?

cirocosta · 2020-01-02T21:04:08Z

slightly related: Research Linux OOM killer behavior for cgroups #20

clarafu · 2020-01-09T17:23:35Z

We saw most likely the same problem happen to hush-house again on the v5.8 upgrade. We investigated it further and found out that it was caused by the volume collector failing to clean up missing volumes.

It was failing to clean up any missing volumes because of this error
"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes"" which was caused by a child volume that points to a parent base resource type volume that is missing. We killed all the orphans and all is well.

xtreme-sameer-vohra · 2020-02-03T16:15:59Z

Seeing this again on Hush House today.
3 workers experiencing what appear to be the same issue where they are not able to clean up volumes due to

"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes""

@clarafu mentioned that this has been addressed in v6.x.x, hence, we're not going to investigate this further. we're just retiring the workers to clean up state

deniseyu changed the title ~~Investigate k8s ungracefully restarting pods out-of-band due to high memory consumption~~ Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

deniseyu commented Dec 17, 2019 •

edited

Loading

xtreme-sameer-vohra commented Dec 23, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019

cirocosta commented Dec 24, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019 •

edited

Loading

vito commented Dec 25, 2019

cirocosta commented Dec 27, 2019 •

edited

Loading

cirocosta commented Dec 27, 2019 •

edited

Loading

xtreme-sameer-vohra commented Dec 31, 2019

cirocosta commented Jan 2, 2020 •

edited

Loading

clarafu commented Jan 9, 2020

xtreme-sameer-vohra commented Feb 3, 2020

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Comments

deniseyu commented Dec 17, 2019 • edited Loading

Short term pain reduction

Long term fixes?

xtreme-sameer-vohra commented Dec 23, 2019 • edited Loading

cirocosta commented Dec 24, 2019 • edited Loading

cirocosta commented Dec 24, 2019

Some more context on the shutdown behavior

cirocosta commented Dec 24, 2019 • edited Loading

cirocosta commented Dec 24, 2019 • edited Loading

tl;dr

vito commented Dec 25, 2019

cirocosta commented Dec 27, 2019 • edited Loading

cirocosta commented Dec 27, 2019 • edited Loading

xtreme-sameer-vohra commented Dec 31, 2019

cirocosta commented Jan 2, 2020 • edited Loading

clarafu commented Jan 9, 2020

xtreme-sameer-vohra commented Feb 3, 2020

deniseyu commented Dec 17, 2019 •

edited

Loading

xtreme-sameer-vohra commented Dec 23, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019 •

edited

Loading

cirocosta commented Dec 24, 2019 •

edited

Loading

cirocosta commented Dec 27, 2019 •

edited

Loading

cirocosta commented Dec 27, 2019 •

edited

Loading

cirocosta commented Jan 2, 2020 •

edited

Loading