Gracefully handle Azure Virtual Machines shutdown within Kubernetes
This tool ensures that kubernetes cluster responds appropriately to events that can cause your Azure Virtual Machines to become unavailable, like evictions Azure Spot Virtual Machines or Reboot. If not handled, your application code may not stop gracefully, take longer to recover full availability, or accidentally schedule work to nodes that are going down. It also can send Telegram or Slack message before Azure Virtual Machines evictions.
Based on Azure Scheduled Events and Safely Drain a Node
Support Linux (amd64, arm64) and Windows (amd64) nodes.
Create basic AKS cluster with Azure CLI
# https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli
# Azure CLI version is 2.50.0
az --version
# Create resource group
az group create \
--name test-aks-group-eastus \
--location eastus
# Create aks cluster, with not spot instances
az aks create \
--resource-group test-aks-group-eastus \
--name MyManagedCluster \
--node-count 1 \
--node-vm-size Standard_DS2_v2 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
# Create Linux nodepool with Spot Virtual Machines and autoscaling
az aks nodepool add \
--resource-group test-aks-group-eastus \
--cluster-name MyManagedCluster \
--name spotpool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--node-vm-size Standard_DS2_v2 \
--min-count 0 \
--max-count 10
# Create Windows nodepool with Spot Virtual Machines and autoscaling
az aks nodepool add \
--resource-group test-aks-group-eastus \
--cluster-name MyManagedCluster \
--os-type Windows \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--name spot01 \
--min-count 1 \
--max-count 3
# Get config to connect to cluster
az aks get-credentials \
--resource-group test-aks-group-eastus \
--name MyManagedCluster
helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical
You can compose your payload with markers that described here
Send Telegram notification
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--set priorityClassName=system-node-critical \
--set 'args[0]=-telegram.token=<telegram token>' \
--set 'args[1]=-telegram.chatID=<telegram chatid>'
Send Slack notification
# create payload file
cat <<EOF | tee values.yaml
priorityClassName: system-node-critical
args:
- -webhook.url=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
- -webhook.template-file=/files/slack-payload.json
- -webhook.contentType=application/json
- -webhook.method=POST
- -webhook.timeout=30s
configMap:
data:
slack-payload.json: |
{
"channel": "#mychannel",
"username": "webhookbot",
"text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
"icon_emoji": ":ghost:"
}
EOF
# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values values.yaml
Send Prometheus Pushgateway event
cat <<EOF | tee values.yaml
priorityClassName: system-node-critical
args:
- -webhook.url=http://prometheus-pushgateway.prometheus.svc.cluster.local:9091/metrics/job/aks-node-termination-handler
- -webhook.template-file=/files/prometheus-pushgateway-payload.txt
- -webhook.contentType=text/plain
- -webhook.method=POST
- -webhook.timeout=30s
configMap:
data:
prometheus-pushgateway-payload.txt: |
node_termination_event{node="{{ .NodeName }}"} 1
EOF
# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values values.yaml
You can test with Simulate Eviction API and change API endpoint to correspond virtualMachineScaleSets
that used in AKS
POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmScaleSetName}/virtualMachines/{instanceId}/simulateEviction?api-version=2021-11-01
Application expose Prometheus metrics in /metrics
endpoint. Installing latest chart will add annotations to pods:
annotations:
prometheus.io/port: "17923"
prometheus.io/scrape: "true"