Kubernetes Runbooks

You should be familiar with Kubernetes (k8s). We use lots of Service, Deployment, Ingress and PersistentVolumeClaim objects along with a few others where needed. Our clusters run with RBAC on Google's Kubernetes Engine (GKE).

Links: infra-oss.moov.io | Google Cloud Status | GKE Dashboard

There are also several community guides for troubleshooting Kubernetes problems:

Kubernetes.io Guide
Cloud.gov Guide
Codefresh.io Guide
Kubernetes Basics (Videos)

Useful Tools

kubespy: Tool for observing Kubernetes resources in real time - GitHub
- kubespy trace: a real-time view into the heart of a Kubernetes Service

Viewing Pod/Container logs

$ kubectl get pods -n infra  | grep kube-ingress
kube-ingress-index-5cb86955ff-md64n   1/1       Running   0          18m
kube-ingress-index-5cb86955ff-xdb5m   1/1       Running   0          18m

# --tail only shows the last N logs
# -f keeps tailing the pod/container stdout
$ kubectl logs -n infra [--tail 10] [-f] kube-ingress-index-5cb86955ff-xdb5m
...

Viewing Logs with Loki / Grafana

Loki is a new log aggregation platform which attempts to transform logs into metric streams (with log information as labels). This project is new, but Grafana allows exploring, building dashboards, and alerts. Checkout the explore page showing paygate logs and the basic usage guide.

Loki Troubleshooting Guide
An early preview of Loki
Loki Metrics

Rolling Pods / Containers

If you need to restart a Pod/Container simply list out the pods and issue kubectl delete:

$ kubectl get pods -n infra  | grep kube-ingress
kube-ingress-index-5cb86955ff-md64n   1/1       Running   0          18m
kube-ingress-index-5cb86955ff-xdb5m   1/1       Running   0          18m

$ kubectl delete pod -n infra kube-ingress-index-5cb86955ff-rtdms
pod "kube-ingress-index-5cb86955ff-rtdms" deleted

Node Sizing / Availability

Currently our Kubernetes cluster runs on preemptible instances which can terminate themselves in under 60s. We largely do this for cost savings before having a product, but will likely run a combination of permanent and preemptible nodes going forward. It's important to remember several guidelines: (Source)

Have a backup plan (permanent node pool)
Find unpopular instance sizes
- If a new family comes out (i.e. m5) m4's might become cheaper and less requested.
Set a maximum bid price
Run multi-zone setups to avoid shortages in a single GCP zone

Emacs

chrisbarrett/kubernetes-el works with our setup. Talk to @adamdecaf for help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes.md

kubernetes.md

Kubernetes Runbooks

Viewing Pod/Container logs

Viewing Logs with Loki / Grafana

Rolling Pods / Containers

Node Sizing / Availability

Emacs

Files

kubernetes.md

Latest commit

History

kubernetes.md

File metadata and controls

Kubernetes Runbooks

Viewing Pod/Container logs

Viewing Logs with Loki / Grafana

Rolling Pods / Containers

Node Sizing / Availability

Emacs