[eks/cluster] Document Migration to Managed Node Group. Bugfix. (#910)

cloudposse · Nov 21, 2023 · 38fc86b · 38fc86b
1 parent 6a05227
commit 38fc86b
Show file tree

Hide file tree

Showing 2 changed files with 166 additions and 2 deletions.
diff --git a/modules/eks/cluster/CHANGELOG.md b/modules/eks/cluster/CHANGELOG.md
@@ -1,4 +1,168 @@
-## Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852)
+## Components PR [#910](https://github.com/cloudposse/terraform-aws-components/pull/910)
+
+Bug fix and updates to Changelog, no action required.
+
+Fixed: Error about managed node group ARNs list being null, which could happen
+when adding a managed node group to an existing cluster that never had one.
+
+## Upgrading to `v1.303.0`
+
+Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852)
+
+This is a bug fix and feature enhancement update. No action is necessary to upgrade.
+However, with the new features and new recommendations, you may want to change
+your configuration.
+
+## Recommended (optional) changes
+
+Previously, we recommended deploying Karpenter to Fargate and not provisioning
+any nodes. However, this causes issues with add-ons that require compute power
+to fully initialize, such as `coredns`, and it can reduce the cluster to a
+single node, removing the high availability that comes from having a node
+per Availability Zone and replicas of pods spread across those nodes.
+
+As a result, we now recommend deploying a minimal node group with a single
+instance (currently recommended to be a `c6a.large`) in each of 3 Availability
+Zones. This will provide the compute power needed to initialize add-ons, and
+will provide high availability for the cluster. As a bonus, it will also
+remove the need to deploy Karpenter to Fargate.
+
+**NOTE about instance type**: The `c6a.large` instance type is relatively
+new. If you have deployed an old version of our ServiceControlPolicy
+`DenyEC2NonNitroInstances`, `DenyNonNitroInstances` (obsolete, replaced by
+`DenyEC2NonNitroInstances`), and/or `DenyEC2InstancesWithoutEncryptionInTransit`,
+you will want to update them to v0.12.0 or choose a difference instance type.
+
+### Migration procedure
+
+To perform the recommended migration, follow these steps:
+
+#### 1. Deploy a minimal node group, move addons to it
+
+Change your `eks/cluster` configuration to set `deploy_addons_to_fargate: false`.
+
+Add the following to your `eks/cluster` configuration, but
+copy the block device name, volume size, and volume type from your existing
+Karpenter provisioner configuration. Also select the correct `ami_type`
+according to the `ami_family` in your Karpenter provisioner configuration.
+
+```yaml
+  node_groups:
+    # will create 1 node group for each item in map
+    # Provision a minimal static node group for add-ons and redundant replicas
+    main:
+      # EKS AMI version to use, e.g. "1.16.13-20200821" (no "v").
+      ami_release_version: null
+      # Type of Amazon Machine Image (AMI) associated with the EKS Node Group
+      # Typically AL2_x86_64 or BOTTLEROCKET_x86_64
+      ami_type: BOTTLEROCKET_x86_64
+      # Additional name attributes (e.g. `1`) for the node group
+      attributes: []
+      # will create 1 auto scaling group in each specified availability zone
+      # or all AZs with subnets if none are specified anywhere
+      availability_zones: null
+      # Whether to enable Node Group to scale its AutoScaling Group
+      cluster_autoscaler_enabled: false
+      # True (recommended) to create new node_groups before deleting old ones, avoiding a temporary outage
+      create_before_destroy: true
+      # Configure storage for the root block device for instances in the Auto Scaling Group
+      # For Bottlerocket, use /dev/xvdb. For all others, use /dev/xvda.
+      block_device_map:
+        "/dev/xvdb":
+          ebs:
+            volume_size: 125 # in GiB
+            volume_type: gp3
+            encrypted: true
+            delete_on_termination: true
+      # Set of instance types associated with the EKS Node Group. Terraform will only perform drift detection if a configuration value is provided.
+      instance_types:
+        - c6a.large
+      # Desired number of worker nodes when initially provisioned
+      desired_group_size: 3
+      max_group_size: 3
+      min_group_size: 3
+      resources_to_tag:
+        - instance
+        - volume
+      tags: null
+```
+
+You do not need to apply the above changes yet, although you can if you
+want to. To reduce overhead, you can apply the changes in the next step.
+
+#### 2. Move Karpenter to the node group, remove legacy support
+
+Delete the `fargate_profiles` section from your `eks/cluster` configuration,
+or at least remove the `karpenter` profile from it. Disable legacy support
+by adding:
+
+```yaml
+  legacy_fargate_1_role_per_profile_enabled: false
+```
+
+#### 2.a Optional: Move Karpenter instance profile to `eks/cluster` component
+
+If you have the patience to manually import and remove a Terraform
+resource, you should move the Karpenter instance profile to the `eks/cluster`
+component. This fixes an issue where the Karpenter instance profile
+could be broken by certain sequences of Terraform operations.
+However, if you have multiple clusters to migrate, this can be tedious,
+and the issue is not a serious one, so you may want to skip this step.
+
+To do this, add the following to your `eks/cluster` configuration:
+
+```yaml
+  legacy_do_not_create_karpenter_instance_profile: false
+```
+
+
+**BEFORE APPLYING CHANGES**:
+Run `atmos terraform plan` (with the appropriate arguments) to see the changes
+that will be made. Among the resources to be created will be
+`aws_iam_instance_profile.default[0]`. Using the same arguments as before, run
+`atmos`, but replace `plan` with `import 'aws_iam_instance_profile.default[0]' <profile-name>`,
+where `<profile-name>` is the name of the profile the plan indicated it would create.
+It will be something like `<cluster-name>-karpenter`.
+
+**NOTE**: If you perform this step, you must also perform 3.a below.
+
+#### 2.b Apply the changes
+
+Apply the changes with `atmos terraform apply`.
+
+#### 3. Upgrade Karpenter
+
+Upgrade the `eks/karpenter` component to the latest version. Follow the upgrade
+instructions to enable the new `karpenter-crd` chart by setting `crd_chart_enabled: true`.
+
+Upgrade to at least Karpenter v0.30.0, which is the first version to support
+factoring in the existing node group when determining the number of nodes to
+provision. This will prevent Karpenter from provisioning nodes when they are not
+needed because the existing node group already has enough capacity. Be
+careful about upgrading to v0.32.0 or later, as that version introduces
+significant breaking changes. We recommend updating to v0.31.2 or later
+versions of v0.31.x, but not v0.32.0 or later, as a first step. This
+provides a safe (revertible) upgrade path to v0.32.0 or later.
+
+#### 3.a Finish Move of Karpenter instance profile to `eks/cluster` component
+
+If you performed step 2.a above, you must also perform this step. If you did
+not perform step 2.a, you must NOT perform this step.
+
+In the `eks/karpenter` stack, set `legacy_create_karpenter_instance_profile: false`.
+
+**BEFORE APPLYING CHANGES**: Remove the Karpenter instance profile from the Terraform state, since
+it is now managed by the `eks/cluster` component, or else Terraform will delete it.
+
+```shell
+atmos terraform state eks/karpenter rm 'aws_iam_instance_profile.default[0]' -s=<stack-name>
+```
+
+#### 3.b Apply the changes
+
+Apply the changes with `atmos terraform apply`.
+
+## Changes included in `v1.303.0`
 
 This is a bug fix and feature enhancement update. No action is necessary to upgrade.
 

diff --git a/modules/eks/cluster/main.tf b/modules/eks/cluster/main.tf
@@ -41,7 +41,7 @@ locals {
   )
 
   # Existing managed worker role ARNs
-  managed_worker_role_arns = local.eks_outputs.eks_managed_node_workers_role_arns
+  managed_worker_role_arns = coalesce(local.eks_outputs.eks_managed_node_workers_role_arns, [])
 
   # If Karpenter IAM role is enabled, add it to the `aws-auth` ConfigMap to allow the nodes launched by Karpenter to join the EKS cluster
   karpenter_role_arn = one(aws_iam_role.karpenter[*].arn)