DOC-875 Add section on disabling automatic node maintenance and OS up…

…grades (#941) Co-authored-by: Joyce Fee <[email protected]>
redpanda-data · Jan 17, 2025 · b264be4 · b264be4
1 parent 97a0e3a
commit b264be4
Show file tree

Hide file tree

Showing 2 changed files with 49 additions and 15 deletions.
diff --git a/...eploy/pages/deployment-option/self-hosted/kubernetes/k-deployment-overview.adoc b/...eploy/pages/deployment-option/self-hosted/kubernetes/k-deployment-overview.adoc
@@ -96,6 +96,8 @@ Managed Kubernetes services, such as Google Kubernetes Engine (GKE) and Amazon E
 
 You remain responsible for deploying and maintaining Redpanda instances on worker nodes.
 
+IMPORTANT: Deploy Kubernetes clusters with *unmanaged (manual) node updates*. Managed (automatic) updates during cluster deployment can lead to service downtime, data loss, or quorum instability. Transitioning from managed updates to unmanaged updates after deployment may require downtime. To avoid these disruptions, plan for unmanaged node updates from the start. See xref:deploy:deployment-option/self-hosted/kubernetes/k-requirements.adoc#node-updates[Kubernetes Cluster Requirements and Recommendations].
+
 === Bare-metal Kubernetes environments
 
 Bare-metal Kubernetes environments give you complete control over both the control plane and the worker nodes, which can be advantageous when you want the following:
@@ -113,14 +115,15 @@ This documentation follows conventions to help users easily identify Kubernetes
 
 == Next steps
 
-Whether you're deploying locally or in the cloud, choose one of the following guides to get started:
+- Get started
+** xref:./local-guide.adoc[Local Deployment Guide] (kind and minikube)
+** xref:./aks-guide.adoc[Azure Kubernetes Service Guide] (AKS)
+** xref:./eks-guide.adoc[Elastic Kubernetes Service Guide] (EKS)
+** xref:./gke-guide.adoc[Google Kubernetes Engine Guide] (GKE)
 
-* xref:./local-guide.adoc[Local Deployment Guide] (kind and minikube)
-* xref:./aks-guide.adoc[Azure Kubernetes Service Guide] (AKS)
-* xref:./eks-guide.adoc[Elastic Kubernetes Service Guide] (EKS)
-* xref:./gke-guide.adoc[Google Kubernetes Engine Guide] (GKE)
+- xref:deploy:deployment-option/self-hosted/kubernetes/k-requirements.adoc[Kubernetes Cluster Requirements and Recommendations]
 
-Or, explore our xref:./k-production-workflow.adoc[production workflow] to learn about requirements and best practices.
+- xref:./k-production-workflow.adoc[Production deployment workflow]
 
 include::shared:partial$suggested-reading.adoc[]
 

diff --git a/modules/deploy/partials/requirements.adoc b/modules/deploy/partials/requirements.adoc
@@ -31,17 +31,17 @@ https://helm.sh/docs/intro/install/[Install Helm^].
 endif::[]
 
 [[number-of-workers]]
-== Number of {node}s
+== Number of nodes
 
 Provision one physical node or virtual machine (VM) for each Redpanda broker that you plan to deploy in your Redpanda cluster.
-Each Redpanda broker requires its own dedicated {node} for the following reasons:
+Each Redpanda broker requires its own dedicated node for the following reasons:
 
-- *Resource isolation*: Redpanda brokers are designed to make full use of available system resources, including CPU and memory. By dedicating a {node} to each broker, you ensure that these resources aren't shared with other applications or processes, avoiding potential performance bottlenecks or contention.
-- *External networking*: External clients should connect directly to the broker that owns the partition they're interested in. This means that each broker must be individually addressable. As clients must connect to the specific broker that is the leader of the partition, they need a mechanism to directly address each broker in the cluster. Assigning each broker to its own dedicated {node} makes this direct addressing feasible, since each {node} will have a unique address. See <<External networking>>.
+- *Resource isolation*: Redpanda brokers are designed to make full use of available system resources, including CPU and memory. By dedicating a node to each broker, you ensure that these resources aren't shared with other applications or processes, avoiding potential performance bottlenecks or contention.
+- *External networking*: External clients should connect directly to the broker that owns the partition they're interested in. This means that each broker must be individually addressable. As clients must connect to the specific broker that is the leader of the partition, they need a mechanism to directly address each broker in the cluster. Assigning each broker to its own dedicated node makes this direct addressing feasible, since each node will have a unique address. See <<External networking>>.
 - *Fault tolerance*: Ensuring each broker operates on a separate node enhances fault tolerance. If one node experiences issues, it won't directly impact the other brokers.
 
 ifdef::env-kubernetes[]
-NOTE: The Redpanda Helm chart configures xref:reference:k-redpanda-helm-spec.adoc#statefulset-podantiaffinity[`podAntiAffinity` rules] to make sure that each Redpanda broker runs on its own {node}.
+NOTE: The Redpanda Helm chart configures xref:reference:k-redpanda-helm-spec.adoc#statefulset-podantiaffinity[`podAntiAffinity` rules] to make sure that each Redpanda broker runs on its own node.
 
 
 *Recommendations*: xref:./kubernetes-deploy.adoc#pod-replicas[Deploy at least three Pod replicas].
@@ -51,11 +51,42 @@ ifndef::env-kubernetes[]
 *Recommendations*: Deploy at least three Redpanda brokers.
 endif::[]
 
+[[node-updates]]
+== Node maintenance and operating system upgrades
+
+Ensure that node and operating system (OS) upgrades are manually managed when running Redpanda in production. Manual control avoids unplanned reboots or replacements that disrupt Redpanda brokers, causing service downtime, data loss, or quorum instability.
+
+=== Limitations of automatic updates
+
+Redpanda is stateful. Redpanda brokers manage partition data and leadership, making them sensitive to disruptions. Proper handling during maintenance is required to:
+
+- Avoid data loss, especially for nodes with ephemeral or local storage.
+- Ensure smooth leadership transitions by decommissioning brokers before removing a node.
+- Minimize service downtime by upgrading nodes one at a time during planned maintenance windows.
+
+However, automatic update mechanisms provided by cloud platforms may not meet Redpanda's stateful requirements. Common issues include:
+
+- Hard timeouts for graceful shutdowns that may not allow Redpanda brokers enough time to complete decommissioning or leadership transitions.
+- Replacements or reboots without ensuring data has been safely migrated or replicated, risking data loss.
+- Parallel upgrades across multiple nodes, which can disrupt quorum or reduce cluster availability.
+
+*Recommendations*:
+
+- Disable automatic node maintenance or upgrades.
+ifdef::env-kubernetes[]
+To prevent managed Kubernetes services from automatically rebooting or upgrading nodes:
+** **Azure AKS**: Set the OS upgrade channel to `None`. https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-node-os-image[Azure Documentation^].
+** **Google GKE**: Disable GKE auto-upgrades for node pools. https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-upgrades[GCP Documentation^].
+** **Amazon EKS**: Avoid enabling EKS node auto-upgrades. https://docs.aws.amazon.com/eks/latest/userguide/worker.html[AWS Documentation^].
+- xref:upgrade:k-upgrade-kubernetes.adoc[Manually manage node upgrades].
+endif::[]
+
+
 == CPU and memory
 
 *Requirements*:
 
-- Two physical, not virtual, cores for each {node}.
+- Two physical, not virtual, cores for each node.
 
 - x86_64 (Westmere or newer) and AWS Graviton family processors are supported.
 
@@ -65,7 +96,7 @@ endif::[]
 
 *Recommendations*:
 
-- Four physical cores for each {node} are strongly recommended.
+- Four physical cores for each node are strongly recommended.
 
 ifdef::env-kubernetes[]
 - xref:./kubernetes-deploy.adoc#resources[Set resource requests and limits for memory and CPU].
@@ -106,7 +137,7 @@ endif::[]
 
 == External networking
 
-- For external access, each {node} in your cluster must have a static, externally accessible IP address.
+- For external access, each node in your cluster must have a static, externally accessible IP address.
 
 - Minimum 10 GigE (10 Gigabit Ethernet) connection to ensure:
 
@@ -120,7 +151,7 @@ endif::[]
 
 == Tuning
 
-Before deploying Redpanda to production, each {node} that runs Redpanda must be tuned to optimize the Linux kernel for Redpanda processes.
+Before deploying Redpanda to production, each node that runs Redpanda must be tuned to optimize the Linux kernel for Redpanda processes.
 
 ifdef::env-kubernetes[]
 See xref:deploy:deployment-option/self-hosted/kubernetes/k-tune-workers.adoc[].