From 38fc86b6ce60d78d7494aa7d7c5cca64f10437da Mon Sep 17 00:00:00 2001 From: Nuru Date: Tue, 21 Nov 2023 08:20:24 -0800 Subject: [PATCH] [eks/cluster] Document Migration to Managed Node Group. Bugfix. (#910) --- modules/eks/cluster/CHANGELOG.md | 166 ++++++++++++++++++++++++++++++- modules/eks/cluster/main.tf | 2 +- 2 files changed, 166 insertions(+), 2 deletions(-) diff --git a/modules/eks/cluster/CHANGELOG.md b/modules/eks/cluster/CHANGELOG.md index 351f53e73..50beb2380 100644 --- a/modules/eks/cluster/CHANGELOG.md +++ b/modules/eks/cluster/CHANGELOG.md @@ -1,4 +1,168 @@ -## Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852) +## Components PR [#910](https://github.com/cloudposse/terraform-aws-components/pull/910) + +Bug fix and updates to Changelog, no action required. + +Fixed: Error about managed node group ARNs list being null, which could happen +when adding a managed node group to an existing cluster that never had one. + +## Upgrading to `v1.303.0` + +Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852) + +This is a bug fix and feature enhancement update. No action is necessary to upgrade. +However, with the new features and new recommendations, you may want to change +your configuration. + +## Recommended (optional) changes + +Previously, we recommended deploying Karpenter to Fargate and not provisioning +any nodes. However, this causes issues with add-ons that require compute power +to fully initialize, such as `coredns`, and it can reduce the cluster to a +single node, removing the high availability that comes from having a node +per Availability Zone and replicas of pods spread across those nodes. + +As a result, we now recommend deploying a minimal node group with a single +instance (currently recommended to be a `c6a.large`) in each of 3 Availability +Zones. This will provide the compute power needed to initialize add-ons, and +will provide high availability for the cluster. As a bonus, it will also +remove the need to deploy Karpenter to Fargate. + +**NOTE about instance type**: The `c6a.large` instance type is relatively +new. If you have deployed an old version of our ServiceControlPolicy +`DenyEC2NonNitroInstances`, `DenyNonNitroInstances` (obsolete, replaced by +`DenyEC2NonNitroInstances`), and/or `DenyEC2InstancesWithoutEncryptionInTransit`, +you will want to update them to v0.12.0 or choose a difference instance type. + +### Migration procedure + +To perform the recommended migration, follow these steps: + +#### 1. Deploy a minimal node group, move addons to it + +Change your `eks/cluster` configuration to set `deploy_addons_to_fargate: false`. + +Add the following to your `eks/cluster` configuration, but +copy the block device name, volume size, and volume type from your existing +Karpenter provisioner configuration. Also select the correct `ami_type` +according to the `ami_family` in your Karpenter provisioner configuration. + +```yaml + node_groups: + # will create 1 node group for each item in map + # Provision a minimal static node group for add-ons and redundant replicas + main: + # EKS AMI version to use, e.g. "1.16.13-20200821" (no "v"). + ami_release_version: null + # Type of Amazon Machine Image (AMI) associated with the EKS Node Group + # Typically AL2_x86_64 or BOTTLEROCKET_x86_64 + ami_type: BOTTLEROCKET_x86_64 + # Additional name attributes (e.g. `1`) for the node group + attributes: [] + # will create 1 auto scaling group in each specified availability zone + # or all AZs with subnets if none are specified anywhere + availability_zones: null + # Whether to enable Node Group to scale its AutoScaling Group + cluster_autoscaler_enabled: false + # True (recommended) to create new node_groups before deleting old ones, avoiding a temporary outage + create_before_destroy: true + # Configure storage for the root block device for instances in the Auto Scaling Group + # For Bottlerocket, use /dev/xvdb. For all others, use /dev/xvda. + block_device_map: + "/dev/xvdb": + ebs: + volume_size: 125 # in GiB + volume_type: gp3 + encrypted: true + delete_on_termination: true + # Set of instance types associated with the EKS Node Group. Terraform will only perform drift detection if a configuration value is provided. + instance_types: + - c6a.large + # Desired number of worker nodes when initially provisioned + desired_group_size: 3 + max_group_size: 3 + min_group_size: 3 + resources_to_tag: + - instance + - volume + tags: null +``` + +You do not need to apply the above changes yet, although you can if you +want to. To reduce overhead, you can apply the changes in the next step. + +#### 2. Move Karpenter to the node group, remove legacy support + +Delete the `fargate_profiles` section from your `eks/cluster` configuration, +or at least remove the `karpenter` profile from it. Disable legacy support +by adding: + +```yaml + legacy_fargate_1_role_per_profile_enabled: false +``` + +#### 2.a Optional: Move Karpenter instance profile to `eks/cluster` component + +If you have the patience to manually import and remove a Terraform +resource, you should move the Karpenter instance profile to the `eks/cluster` +component. This fixes an issue where the Karpenter instance profile +could be broken by certain sequences of Terraform operations. +However, if you have multiple clusters to migrate, this can be tedious, +and the issue is not a serious one, so you may want to skip this step. + +To do this, add the following to your `eks/cluster` configuration: + +```yaml + legacy_do_not_create_karpenter_instance_profile: false +``` + + +**BEFORE APPLYING CHANGES**: +Run `atmos terraform plan` (with the appropriate arguments) to see the changes +that will be made. Among the resources to be created will be +`aws_iam_instance_profile.default[0]`. Using the same arguments as before, run +`atmos`, but replace `plan` with `import 'aws_iam_instance_profile.default[0]' `, +where `` is the name of the profile the plan indicated it would create. +It will be something like `-karpenter`. + +**NOTE**: If you perform this step, you must also perform 3.a below. + +#### 2.b Apply the changes + +Apply the changes with `atmos terraform apply`. + +#### 3. Upgrade Karpenter + +Upgrade the `eks/karpenter` component to the latest version. Follow the upgrade +instructions to enable the new `karpenter-crd` chart by setting `crd_chart_enabled: true`. + +Upgrade to at least Karpenter v0.30.0, which is the first version to support +factoring in the existing node group when determining the number of nodes to +provision. This will prevent Karpenter from provisioning nodes when they are not +needed because the existing node group already has enough capacity. Be +careful about upgrading to v0.32.0 or later, as that version introduces +significant breaking changes. We recommend updating to v0.31.2 or later +versions of v0.31.x, but not v0.32.0 or later, as a first step. This +provides a safe (revertible) upgrade path to v0.32.0 or later. + +#### 3.a Finish Move of Karpenter instance profile to `eks/cluster` component + +If you performed step 2.a above, you must also perform this step. If you did +not perform step 2.a, you must NOT perform this step. + +In the `eks/karpenter` stack, set `legacy_create_karpenter_instance_profile: false`. + +**BEFORE APPLYING CHANGES**: Remove the Karpenter instance profile from the Terraform state, since +it is now managed by the `eks/cluster` component, or else Terraform will delete it. + +```shell +atmos terraform state eks/karpenter rm 'aws_iam_instance_profile.default[0]' -s= +``` + +#### 3.b Apply the changes + +Apply the changes with `atmos terraform apply`. + +## Changes included in `v1.303.0` This is a bug fix and feature enhancement update. No action is necessary to upgrade. diff --git a/modules/eks/cluster/main.tf b/modules/eks/cluster/main.tf index 229935498..2415c74c5 100644 --- a/modules/eks/cluster/main.tf +++ b/modules/eks/cluster/main.tf @@ -41,7 +41,7 @@ locals { ) # Existing managed worker role ARNs - managed_worker_role_arns = local.eks_outputs.eks_managed_node_workers_role_arns + managed_worker_role_arns = coalesce(local.eks_outputs.eks_managed_node_workers_role_arns, []) # If Karpenter IAM role is enabled, add it to the `aws-auth` ConfigMap to allow the nodes launched by Karpenter to join the EKS cluster karpenter_role_arn = one(aws_iam_role.karpenter[*].arn)