You should be familiar with Kubernetes (k8s). We use lots of Service
, Deployment
, Ingress
and PersistentVolumeClaim
objects along with a few others where needed. Our clusters run with RBAC on Google's Kubernetes Engine (GKE).
Links: infra-oss.moov.io | Google Cloud Status | GKE Dashboard
There are also several community guides for troubleshooting Kubernetes problems:
Useful Tools
kubespy
: Tool for observing Kubernetes resources in real time - GitHub
$ kubectl get pods -n infra | grep kube-ingress
kube-ingress-index-5cb86955ff-md64n 1/1 Running 0 18m
kube-ingress-index-5cb86955ff-xdb5m 1/1 Running 0 18m
# --tail only shows the last N logs
# -f keeps tailing the pod/container stdout
$ kubectl logs -n infra [--tail 10] [-f] kube-ingress-index-5cb86955ff-xdb5m
...
See also: Viewing logs in Kubernetes
Loki is a new log aggregation platform which attempts to transform logs into metric streams (with log information as labels). This project is new, but Grafana allows exploring, building dashboards, and alerts. Checkout the explore page showing paygate logs and the basic usage guide.
If you need to restart a Pod/Container simply list out the pods and issue kubectl delete
:
$ kubectl get pods -n infra | grep kube-ingress
kube-ingress-index-5cb86955ff-md64n 1/1 Running 0 18m
kube-ingress-index-5cb86955ff-xdb5m 1/1 Running 0 18m
$ kubectl delete pod -n infra kube-ingress-index-5cb86955ff-rtdms
pod "kube-ingress-index-5cb86955ff-rtdms" deleted
Currently our Kubernetes cluster runs on preemptible instances which can terminate themselves in under 60s. We largely do this for cost savings before having a product, but will likely run a combination of permanent and preemptible nodes going forward. It's important to remember several guidelines: (Source)
- Have a backup plan (permanent node pool)
- Find unpopular instance sizes
- If a new family comes out (i.e. m5) m4's might become cheaper and less requested.
- Set a maximum bid price
- Run multi-zone setups to avoid shortages in a single GCP zone
chrisbarrett/kubernetes-el works with our setup. Talk to @adamdecaf for help.