Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Open
deniseyu opened this issue Dec 17, 2019 · 12 comments

Comments

@deniseyu
Copy link
Contributor

deniseyu commented Dec 17, 2019

Related: https://github.com/pivotal/concourse-ops/issues/170

We have observed on Hush House that sometimes, a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior

Concourse doesn't know about this restart, and when the pod is recreated, it re-registers itself using the same name. However, when k8s recreates a pod in this way, it deletes the volumes; Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC. The symptoms of this are that Hush House users will see error messages like this:

volume '7615ac07-7c88-4555-6888-9e87f49317bf' disappeared from worker 'workers-worker-10'

Short term pain reduction

In the short term, we should look at emitting more descriptive errors and adding some operational guidelines and documentation around how to manage this problem. We believe that restarting pods using kubectl won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:

fly -t <target> land-worker -w <worker-name>
fly -t <target> prune-worker -w <worker-name>

Running prune will remove it from the ATC DB.

Long term fixes?

We could look into modifying the hard eviction limits or seeing if there's a way we can add a lifecycle hook to fire off a retire-worker request or something to ATC, to try to keep ATC DB and the k8s cluster synced up.

We can also look into changing the behaviour of the init container so that volumes are not totally destroyed when restarts happen. We really only want this to be true if a worker is ephemeral, but in the case of long-lived workers, it would be worthwhile to see if we can relocate and reattach those volumes when a pod comes back. spoke with @kcmannem - it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back

Note: We are using anti-affinity so we only have one pod per node, but it looks like if that one pod's memory is too high it still gets "evicted".

@deniseyu deniseyu changed the title Investigate k8s ungracefully restarting pods out-of-band due to high memory consumption Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption Dec 17, 2019
@xtreme-sameer-vohra
Copy link
Contributor

xtreme-sameer-vohra commented Dec 23, 2019

Experienced this again week of Dec 16 - Dec 20th ( primarily on Dec 17th
and 20th )

Here are some screenshots with what we observed

  • Dec 17th

Screen Shot 2019-12-17 at 11 52 06 AM

  • Dec 20th

Screen Shot 2019-12-20 at 3 06 11 PM

Note the workers with 0 containers, these had been restarted/recreated by K8s and while they were successfully registered, they were not in a usable state.

Screen Shot 2019-12-20 at 3 06 27 PM

Screen Shot 2019-12-20 at 3 06 37 PM

Screen Shot 2019-12-20 at 3 06 48 PM

Screen Shot 2019-12-20 at 3 08 15 PM

Screen Shot 2019-12-20 at 3 08 32 PM

Screen Shot 2019-12-20 at 3 08 47 PM

Screen Shot 2019-12-20 at 3 27 43 PM

Screen Shot 2019-12-20 at 3 31 33 PM

Screen Shot 2019-12-20 at 3 32 11 PM

Interesting ones to note was that one node entered a ReadOnly Filesystem mode. We had to manually kill the node via the gcloud console

@cirocosta
Copy link
Member

cirocosta commented Dec 24, 2019

Hey,

With regards to the following:

Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC.

do you remember if you've seen the error happening for long periods of time, and
never recovering?

it turns out that we do have the proper diffing in place during the flow of
reporting volumes and getting them marked as missing or not.

e.g., during the ReportWorkerVolumes (iniated from the worker through
`volumesweeper) flow:

    i. `missing_since` gets updated during the ReportWorkerVolumes flow


            this is supposed to account for the case where the database has volumes
            that the worker doesn't

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).UpdateVolumesMissingSince
                            /src/atc/db/volume_repository.go:125 +0x9d5
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:65 +0xc68
                    net/http.HandlerFunc.ServeHTTP
                            /usr/local/go/src/net/http/server.go:2007 +0x44


            on the db, we can verify that being set:

                    concourse=# select handle, worker_name, missing_since from volumes;

                                    handle                | worker_name |         missing_since
                    --------------------------------------+-------------+-------------------------------
                     2a2cfbb8-ae97-4ada-4e80-6d8d64031cb7 | worker      | 2019-12-24 16:00:52.270141+00
                     89503487-3816-4bac-7b3c-b7109cc8b946 | worker      | 2019-12-24 16:00:52.270141+00
                    (2 rows)



    ii. volumes not known by the db get transitioned to `destroying`


            on a later tick, `destroy`ing volumes get actually destroyed by
            the volume sweeper (in the worker itself)

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).DestroyUnknownVolumes
                            /src/atc/db/volume_repository.go:577 +0x3d6
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:47 +0x859

later, during gc:

    i. volumes with `missing_since > threshold` get removed
            (literally, `DELETE`d)

ps.: by default, that threshold is 5min.

This means that as long as the worker gets up, after 5m (or whatever time is
configured in the installation), we should have the references being cleaned up.

To see this in practice, change docker-compose.yml to have a fixed-name and use a non-persistent directory for its volumes:

diff --git a/docker-compose.yml b/docker-compose.yml
index 400598ee0..cf8e83957 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -41,12 +41,10 @@ services:
     - 7788:7788
     stop_signal: SIGUSR2
     environment:
+      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_BAGGAGECLAIM_DRIVER: naive
+      CONCOURSE_BIND_IP: 0.0.0.0
       CONCOURSE_LOG_LEVEL: debug
       CONCOURSE_TSA_HOST: web:2222
-
-      # avoid using loopbacks
-      CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
-
-      # so we can reach Garden/Baggageclaim for debugging
-      CONCOURSE_BIND_IP: 0.0.0.0
-      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_WORK_DIR: /tmp
+      CONCOURSE_NAME: worker

then run task. this will incur in volumes being created.

now, ungracefully remove the worker (docker rm -f concourse_worker_1),
and then re-create it (docker-compose up -d worker would do it).

this will bring the worker back with a mismatch between its state (no volumes
in the worker) and the db (volumes marked as being in the worker).

if you try to run the task again, that exact error will pop up, but after few minutes,
web will get rid of the extra volumes in the db.

@cirocosta
Copy link
Member

it looks like this was a deliberate engineering decision made for the Helm
chart design. Will have to chat with @cirocosta when he is back

Yeah, that's because prior to gc containers and vols present in the worker but
not on the db
, the worker would come back with a set of volumes that would
differ what the db had, and that'd cause problems (as we'd never gc vols that
were in the worker but not in the db), thus, they'd accumulate forever, but it
seems like given that now we do have better logic for dealing with the
differences, we could think about not having the removal anymore (as the system
would go towards the right thing in any case).

Some more context on the shutdown behavior

During a graceful shutdown, the worker retires:

  • the preStop has a kill -s SIGUSR2 $concourse which leads to retire, which
    means that in the db, the vols will go away, but not right away in the
    worker - as that takes few ticks (it'll first become orphan, then marked as
    for destruction, then destroyed by the worker - AFAIR)

by always removing them all during initialization, we can ensure that they would
not forever persist in that pvc

e.g.:

	worker-1
		vols: 1,2,3,4	(in the pvc: `vols/{1,2,3,4}`)



kubectl delete pod worker-1


t0		==> kill -s SIGUSR2 worker-1 (preStop)
t1		==> wait for the proc to die ...

t2	==> worker-1 retires
t3		==> container goes away
t4		==> pvc stays around
			vols/{1,2,3,4}

t5 	worker-1 gets scheduled to antoher node ...
t6		pvc-worker-1 gets attached to worker-1
			vols/{1,2,3,4} are there

t7	worker-1's initContainer gets started
		==> cleans `vols/` from the pvc

t8	worker-1 container starts
		==> state matches the `db` (no vols here, no vols there)

but then, if there's an ungraceful shutdown (anything that does not go
through preStop), then the db will not be cleaned up w/ regards to that
worker (as there'll never exist a kill -s SIGUSR2 $concourse that makes it
go through the retire process), and if it turns out to be a whole-pod shutdown, then
the initContainer will indeed remove the contents, whick will cause the mismatch:

	worker-1
		vols: 1,2,3,4	(in the pvc: `vols/{1,2,3,4}`)



	t0	==> destroy

As I've mentioned in the previous comment, the mismatch should be fine - on few
ticks, we'd see Concourse dealing with those volumes not existing in the worker
by marking them with missing_since, and later, having them deleted.

@cirocosta
Copy link
Member

cirocosta commented Dec 24, 2019

We believe that restarting pods using kubectl won't fix the problem of ATC
having an incorrect perception of workers' volumes; so we did the following:

oh, actually, a kubectl delete pod would end up putting the worker through
"the right termination process" which is going through preStop, which takes
care of doing the retire-worker before termination, which would've cleaned up
the database-side first, and then on re-initialization, cleaning up the pvc
(which would make vols_in_db=0 and vols_in_the_worker=0)


We could look into modifying the hard eviction limits

hmmm I don't think that's really the answer as that's something that's up to the
consumers of the Chart; even if we did remove those limits that we have set
under resources, it'd not really take the problem away - the kubelet could
still do a hard eviction on node-configured values that we don't know about as
pod spec authors (aside from any other form of ungraceful termination that will
eventually happen).

e.g.:

kubectl proxy --port=8001
curl -sSL "http://localhost:8001/api/v1/nodes/gke-hush-house-workers-3-687beb5a-3fgz/proxy/configz" | jq '.kubeletconfig.evictionHard'
{
  "memory.available": "100Mi",
  "nodefs.available": "10%",
  "nodefs.inodesFree": "5%",
  "pid.available": "10%"
}

@cirocosta
Copy link
Member

cirocosta commented Dec 24, 2019

tl;dr

Having both remount volumes on restart and gc containers and vols present in the worker but not on the db, it seems sensible to me to consider going the route of not removing the volumes on initialization in any case.

(this can be activated in the chart via --set worker.cleanUpWorkDirOnStart=false)

Given that we now diff on both ways, and have the volumes properly mounted when getting up (in any case), I think we're safe to go without initial removals - Concourse will take care of removing any leftovers on either sides.

ps.: while I'm aware of the fixes for doing the proper mounting for overlay, I'm not entirely sure if everything would still be alright for btrfs in the case where loopback devices are used (although I think it's alright already)

@vito
Copy link
Member

vito commented Dec 25, 2019

@cirocosta I think we might have about this previously, but my memory is failing me: workers coming back under the same goes against the intent of the current worker lifecycle design, where the name is supposed to directly correspond to the lifecycle of both the processes and data (i.e. containers and volumes) managed by the worker. If they go away, the worker should just get a new name. These kinds of problems seem to be caused by not following that rule.

What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true? With the chart becoming the preferred method to deploy Concourse, it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'. Sorry if this is reviving an old discussion, since I think it is. 😅 If you remember where we chatted about this previously feel free to just link it.

@cirocosta
Copy link
Member

cirocosta commented Dec 27, 2019

heey @vito,

These kinds of problems seem to be caused by not following that rule.

ooh, totally!

it seems like we may want to either adjust the lifecycle design more
fundamentally or change how the chart works to 'fall in line'.

my impression is that with the way that nowadays we perform the diffs in both
ways, we're actually very close (if not already fine) to handle state persisting
around 🤔 (I might be very wrong though lol)


What's the technical reason for the chart to prefer 'stable' names as opposed
to just ephemeral names + ephemeral: true?

We've been using non-ephemeral names just because of the use of StatefulSets
for our workers (see worker-statefulset.yaml):

`StatefulSets` are valuable for applications that require one or more of the
following:

- **Stable, unique network identifiers**.
- **Stable, persistent storage**.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.

(from Using StatefulSets)

As you mentioned, it diverges from what we want in terms of both the stable
naming across pod starts1, and having the storage trully persistent (regardless
of where the pod goes, the volume + its contents goes together with it) -
properties of StatefulSets.

But, still, it gives us something that's very valuable: the ability to have
templated persistent volume claims. This is what allows each pod to get a
persistent volume attached to it. Without that, we'd end up consuming whatever
storage is set up for ephemeral storage in that machine (in the case of GKE,
e.g., that'd be the same storage used for the whole machine, including the storage
for container images, other workloads, etc)2.

Leveraging StatefulSets then, we can declare that we want our workers to come
with a 1TB disk and have that disk just for the worker pods, regardless of
whatever is used for the kubelet and other workloads in the machine (i.e.,
dedicated IOPS for the worker). With a Deployment, that's not possible.



1: we never tweaked CONCOURSE_NAME in the chart in any form, it's
using the hostname that gets generated by default ($statefulset-$index), but
that could be changed, for sure.

2: that's partially true - it's possible to get a deployment
bind-mounting a local disk that's separate from the regular disk used to the
kubelet and others, you'd only need to request that extra local disk and make
sure you're the only one bind-mounting it, having prior-knowledge of what the
device name is, and also having it properly formatted, etc.

@cirocosta
Copy link
Member

cirocosta commented Dec 27, 2019

From what I understand, the best case would be to have PVCs (so that the worker
pod can have all the IOPS and storage to itself, regardless of where it
lands1), with pods being ephemeral, and the PVCs' lifecycle bound to
the pods lifecycle (when pods go away, PVCs would go away too, so that we don't
keep any state around for pods that were terminated). i.e., a blend of
properties from Deployments and StatefulSets.

With the way that kubernetes is extensible (e.g., see operator pattern),
that's something that we could persue - have a CRD like
"EphemeralStatefulSet" and then manually control the lifecycle of pods and pvcs
in a controller. should we? I don't think so 😅

While I agree that by making Concourse very smart we can have all sort
of edge cases and problems to deal with, shooting ourselves in the foot, I think
this kind of resiliency that we're talking about here is worth having baked
in. Even in those cases of ephemeral names + ephemeral disk, we can get things
like a container restart (where the pod does not get recreated and the container
comes up faster than web would notice a stall) where we'd need to fix the
state; or have scenarios where we can't even have ephemeral: true given the
nature of unreliable networks.


1: the IOPS part is partially true - depending on vCPU, the IOPS can
be different


damn, sorry for those walls of text 😬

@xtreme-sameer-vohra
Copy link
Contributor

It does appear that based on our GC logic that volumes in the DB but not on the worker should be cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick ).

Our hypothesis that the workers were in a degraded state (had 0 containers showing ~1-2 hrs after they seemed to have OOM and re-created by K8s) and was caused by missing volumes. Our assumption was that since volumes couldn't be found ( even base resources ) no containers could be created and all builds on those workers would continue failing indefinitely.

However, since our GC should have been cleaning up the volumes from the DB, our assumption appears to be invalid.

If this were to occur again, we can investigate further to validate why the workers remain in an unusable state;

Concrete next steps when the worker appears to be degraded due to the same reason again;

  • look at DB state for the worker's volumes. Is the missing_since flag being set ?
  • if so, do we observe the volumes being cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick on web ).
  • What errors are we seeing in the logs ( IIRC we were observing FindOrCreateContainer errors) ?

@cirocosta
Copy link
Member

cirocosta commented Jan 2, 2020

slightly related: Research Linux OOM killer behavior for cgroups #20

@clarafu
Copy link
Contributor

clarafu commented Jan 9, 2020

We saw most likely the same problem happen to hush-house again on the v5.8 upgrade. We investigated it further and found out that it was caused by the volume collector failing to clean up missing volumes.

It was failing to clean up any missing volumes because of this error
"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes"" which was caused by a child volume that points to a parent base resource type volume that is missing. We killed all the orphans and all is well.

@xtreme-sameer-vohra
Copy link
Contributor

Seeing this again on Hush House today.
3 workers experiencing what appear to be the same issue where they are not able to clean up volumes due to

"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes""

@clarafu mentioned that this has been addressed in v6.x.x, hence, we're not going to investigate this further. we're just retiring the workers to clean up state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants