Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker termination stuck for 47d #128

Open
xtremerui opened this issue May 11, 2020 · 1 comment
Open

Worker termination stuck for 47d #128

xtremerui opened this issue May 11, 2020 · 1 comment

Comments

@xtremerui
Copy link
Contributor

When me and @jomsie were deploying hush-house/ci we noticed the make deploy-ci stuck and kubectl get pods shows there is a worker has been in Terminating state for 47 days.

The kubectl describe output

Name:                      ci-worker-0
Namespace:                 ci
Priority:                  0
Node:                      gke-hush-house-ci-workers-79a0ea06-2crm/10.10.0.30
Start Time:                Tue, 24 Mar 2020 19:28:26 -0400
Labels:                    app=ci-worker
                           controller-revision-hash=ci-worker-67c499b88
                           release=ci
                           statefulset.kubernetes.io/pod-name=ci-worker-0
Annotations:               cni.projectcalico.org/podIP: 10.11.7.24/32
                           manual-update-revision: 1
Status:                    Terminating (lasts 9d)
Termination Grace Period:  3600s
IP:                        10.11.7.24
Controlled By:             StatefulSet/ci-worker
Init Containers:
  ci-worker-init-rm:
    Container ID:  docker://3bd89ed4bc3c452c977cae9a879aace70f8c06b9d15b6b19daa7bd734b5140e2
    Image:         concourse/concourse-rc:6.0.0-rc.62
    Image ID:      docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -ce
      for v in $((btrfs subvolume list --sort=-ogen "/concourse-work-dir" || true) | awk '{print $9}'); do
        (btrfs subvolume show "/concourse-work-dir/$v" && btrfs subvolume delete "/concourse-work-dir/$v") || true
      done
      rm -rf "/concourse-work-dir/*"
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 24 Mar 2020 19:28:58 -0400
      Finished:     Tue, 24 Mar 2020 19:28:58 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /concourse-work-dir from concourse-work-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Containers:
  ci-worker:
    Container ID:  docker://5ca54733ccbf8753e6b3e0581537d92c015d13174df3fbbdadc8e4bfaee9535b
    Image:         concourse/concourse-rc:6.0.0-rc.62
    Image ID:      docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
    Port:          8888/TCP
    Host Port:     0/TCP
    Args:
      worker
    State:          Running
      Started:      Tue, 24 Mar 2020 19:28:59 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     7500m
      memory:  14Gi
    Requests:
      cpu:     0
      memory:  0
    Liveness:  http-get http://:worker-hc/ delay=10s timeout=45s period=60s #success=1 #failure=10
    Environment:
      CONCOURSE_REBALANCE_INTERVAL:               2h
      CONCOURSE_SWEEP_INTERVAL:                   30s
      CONCOURSE_CONNECTION_DRAIN_TIMEOUT:         1h
      CONCOURSE_HEALTHCHECK_BIND_IP:              0.0.0.0
      CONCOURSE_HEALTHCHECK_BIND_PORT:            8888
      CONCOURSE_HEALTHCHECK_TIMEOUT:              40s
      CONCOURSE_DEBUG_BIND_IP:                    127.0.0.1
      CONCOURSE_DEBUG_BIND_PORT:                  7776
      CONCOURSE_WORK_DIR:                         /concourse-work-dir
      CONCOURSE_BIND_IP:                          127.0.0.1
      CONCOURSE_BIND_PORT:                        7777
      CONCOURSE_LOG_LEVEL:                        info
      CONCOURSE_TSA_HOST:                         ci-web:2222
      CONCOURSE_TSA_PUBLIC_KEY:                   /concourse-keys/host_key.pub
      CONCOURSE_TSA_WORKER_PRIVATE_KEY:           /concourse-keys/worker_key
      CONCOURSE_GARDEN_BIN:                       gdn
      CONCOURSE_BAGGAGECLAIM_LOG_LEVEL:           info
      CONCOURSE_BAGGAGECLAIM_BIND_IP:             127.0.0.1
      CONCOURSE_BAGGAGECLAIM_BIND_PORT:           7788
      CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_IP:       127.0.0.1
      CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_PORT:     7787
      CONCOURSE_BAGGAGECLAIM_DRIVER:              overlay
      CONCOURSE_BAGGAGECLAIM_BTRFS_BIN:           btrfs
      CONCOURSE_BAGGAGECLAIM_MKFS_BIN:            mkfs.btrfs
      CONCOURSE_VOLUME_SWEEPER_MAX_IN_FLIGHT:     5
      CONCOURSE_CONTAINER_SWEEPER_MAX_IN_FLIGHT:  5
      CONCOURSE_GARDEN_NETWORK_POOL:              10.254.0.0/16
      CONCOURSE_GARDEN_MAX_CONTAINERS:            500
      CONCOURSE_GARDEN_DENY_NETWORK:              169.254.169.254/32
    Mounts:
      /concourse-keys from concourse-keys (ro)
      /concourse-work-dir from concourse-work-dir (rw)
      /pre-stop-hook.sh from pre-stop-hook (rw,path="pre-stop-hook.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  concourse-work-dir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  concourse-work-dir-ci-worker-0
    ReadOnly:   false
  pre-stop-hook:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ci-worker
    Optional:  false
  concourse-keys:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ci-worker
    Optional:    false
  ci-worker-token-k9sqg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ci-worker-token-k9sqg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  cloud.google.com/gke-nodepool=ci-workers
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                 From                                              Message
  ----     ------             ----                ----                                              -------
  Normal   Killing            41m (x236 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  Stopping container ci-worker
  Warning  FailedKillPod      41m (x235 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  error killing pod: failed to "KillContainer" for "ci-worker" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Warning  FailedPreStopHook  41m (x235 over 9d)  kubelet, gke-hush-house-ci-workers-79a0ea06-2crm  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "ci-worker" in Pod "ci-worker-0_ci(eda22160-fcb7-4150-96d0-93827749746e)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown\r\n"

Might worth checking the end error to help making worker lifecycle management better.

@clarafu
Copy link
Contributor

clarafu commented Sep 21, 2021

Seeing this now too with our hush-house deployment. It caused an upgrade to 7.4.0 to take 54 days and still counting because the upgrade of the workers is still not finished. It is currently upgrading workers-worker-8 and still needs to do all the workers from 1-7.

image

Also seeing the same error

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants