You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When me and @jomsie were deploying hush-house/ci we noticed the make deploy-ci stuck and kubectl get pods shows there is a worker has been in Terminating state for 47 days.
The kubectl describe output
Name: ci-worker-0
Namespace: ci
Priority: 0
Node: gke-hush-house-ci-workers-79a0ea06-2crm/10.10.0.30
Start Time: Tue, 24 Mar 2020 19:28:26 -0400
Labels: app=ci-worker
controller-revision-hash=ci-worker-67c499b88
release=ci
statefulset.kubernetes.io/pod-name=ci-worker-0
Annotations: cni.projectcalico.org/podIP: 10.11.7.24/32
manual-update-revision: 1
Status: Terminating (lasts 9d)
Termination Grace Period: 3600s
IP: 10.11.7.24
Controlled By: StatefulSet/ci-worker
Init Containers:
ci-worker-init-rm:
Container ID: docker://3bd89ed4bc3c452c977cae9a879aace70f8c06b9d15b6b19daa7bd734b5140e2
Image: concourse/concourse-rc:6.0.0-rc.62
Image ID: docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
-ce
for v in $((btrfs subvolume list --sort=-ogen "/concourse-work-dir" || true) | awk '{print $9}'); do
(btrfs subvolume show "/concourse-work-dir/$v" && btrfs subvolume delete "/concourse-work-dir/$v") || true
done
rm -rf "/concourse-work-dir/*"
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 24 Mar 2020 19:28:58 -0400
Finished: Tue, 24 Mar 2020 19:28:58 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/concourse-work-dir from concourse-work-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Containers:
ci-worker:
Container ID: docker://5ca54733ccbf8753e6b3e0581537d92c015d13174df3fbbdadc8e4bfaee9535b
Image: concourse/concourse-rc:6.0.0-rc.62
Image ID: docker-pullable://concourse/concourse-rc@sha256:dc05e609fdcd4a59b2a34588b899664b5f85e747a8556300f5c48ca7042c7c06
Port: 8888/TCP
Host Port: 0/TCP
Args:
worker
State: Running
Started: Tue, 24 Mar 2020 19:28:59 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 7500m
memory: 14Gi
Requests:
cpu: 0
memory: 0
Liveness: http-get http://:worker-hc/ delay=10s timeout=45s period=60s #success=1 #failure=10
Environment:
CONCOURSE_REBALANCE_INTERVAL: 2h
CONCOURSE_SWEEP_INTERVAL: 30s
CONCOURSE_CONNECTION_DRAIN_TIMEOUT: 1h
CONCOURSE_HEALTHCHECK_BIND_IP: 0.0.0.0
CONCOURSE_HEALTHCHECK_BIND_PORT: 8888
CONCOURSE_HEALTHCHECK_TIMEOUT: 40s
CONCOURSE_DEBUG_BIND_IP: 127.0.0.1
CONCOURSE_DEBUG_BIND_PORT: 7776
CONCOURSE_WORK_DIR: /concourse-work-dir
CONCOURSE_BIND_IP: 127.0.0.1
CONCOURSE_BIND_PORT: 7777
CONCOURSE_LOG_LEVEL: info
CONCOURSE_TSA_HOST: ci-web:2222
CONCOURSE_TSA_PUBLIC_KEY: /concourse-keys/host_key.pub
CONCOURSE_TSA_WORKER_PRIVATE_KEY: /concourse-keys/worker_key
CONCOURSE_GARDEN_BIN: gdn
CONCOURSE_BAGGAGECLAIM_LOG_LEVEL: info
CONCOURSE_BAGGAGECLAIM_BIND_IP: 127.0.0.1
CONCOURSE_BAGGAGECLAIM_BIND_PORT: 7788
CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_IP: 127.0.0.1
CONCOURSE_BAGGAGECLAIM_DEBUG_BIND_PORT: 7787
CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
CONCOURSE_BAGGAGECLAIM_BTRFS_BIN: btrfs
CONCOURSE_BAGGAGECLAIM_MKFS_BIN: mkfs.btrfs
CONCOURSE_VOLUME_SWEEPER_MAX_IN_FLIGHT: 5
CONCOURSE_CONTAINER_SWEEPER_MAX_IN_FLIGHT: 5
CONCOURSE_GARDEN_NETWORK_POOL: 10.254.0.0/16
CONCOURSE_GARDEN_MAX_CONTAINERS: 500
CONCOURSE_GARDEN_DENY_NETWORK: 169.254.169.254/32
Mounts:
/concourse-keys from concourse-keys (ro)
/concourse-work-dir from concourse-work-dir (rw)
/pre-stop-hook.sh from pre-stop-hook (rw,path="pre-stop-hook.sh")
/var/run/secrets/kubernetes.io/serviceaccount from ci-worker-token-k9sqg (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
concourse-work-dir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: concourse-work-dir-ci-worker-0
ReadOnly: false
pre-stop-hook:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ci-worker
Optional: false
concourse-keys:
Type: Secret (a volume populated by a Secret)
SecretName: ci-worker
Optional: false
ci-worker-token-k9sqg:
Type: Secret (a volume populated by a Secret)
SecretName: ci-worker-token-k9sqg
Optional: false
QoS Class: Burstable
Node-Selectors: cloud.google.com/gke-nodepool=ci-workers
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 41m (x236 over 9d) kubelet, gke-hush-house-ci-workers-79a0ea06-2crm Stopping container ci-worker
Warning FailedKillPod 41m (x235 over 9d) kubelet, gke-hush-house-ci-workers-79a0ea06-2crm error killing pod: failed to "KillContainer" for "ci-worker" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Warning FailedPreStopHook 41m (x235 over 9d) kubelet, gke-hush-house-ci-workers-79a0ea06-2crm Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "ci-worker" in Pod "ci-worker-0_ci(eda22160-fcb7-4150-96d0-93827749746e)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown\r\n"
Might worth checking the end error to help making worker lifecycle management better.
The text was updated successfully, but these errors were encountered:
Seeing this now too with our hush-house deployment. It caused an upgrade to 7.4.0 to take 54 days and still counting because the upgrade of the workers is still not finished. It is currently upgrading workers-worker-8 and still needs to do all the workers from 1-7.
When me and @jomsie were deploying hush-house/ci we noticed the
make deploy-ci
stuck andkubectl get pods
shows there is a worker has been inTerminating
state for 47 days.The
kubectl describe
outputMight worth checking the end error to help making worker lifecycle management better.
The text was updated successfully, but these errors were encountered: