Skip to content

Commit

Permalink
[eks/cluster] Document Migration to Managed Node Group. Bugfix. (#910)
Browse files Browse the repository at this point in the history
  • Loading branch information
Nuru authored Nov 21, 2023
1 parent 6a05227 commit 38fc86b
Show file tree
Hide file tree
Showing 2 changed files with 166 additions and 2 deletions.
166 changes: 165 additions & 1 deletion modules/eks/cluster/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,168 @@
## Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852)
## Components PR [#910](https://github.com/cloudposse/terraform-aws-components/pull/910)

Bug fix and updates to Changelog, no action required.

Fixed: Error about managed node group ARNs list being null, which could happen
when adding a managed node group to an existing cluster that never had one.

## Upgrading to `v1.303.0`

Components PR [#852](https://github.com/cloudposse/terraform-aws-components/pull/852)

This is a bug fix and feature enhancement update. No action is necessary to upgrade.
However, with the new features and new recommendations, you may want to change
your configuration.

## Recommended (optional) changes

Previously, we recommended deploying Karpenter to Fargate and not provisioning
any nodes. However, this causes issues with add-ons that require compute power
to fully initialize, such as `coredns`, and it can reduce the cluster to a
single node, removing the high availability that comes from having a node
per Availability Zone and replicas of pods spread across those nodes.

As a result, we now recommend deploying a minimal node group with a single
instance (currently recommended to be a `c6a.large`) in each of 3 Availability
Zones. This will provide the compute power needed to initialize add-ons, and
will provide high availability for the cluster. As a bonus, it will also
remove the need to deploy Karpenter to Fargate.

**NOTE about instance type**: The `c6a.large` instance type is relatively
new. If you have deployed an old version of our ServiceControlPolicy
`DenyEC2NonNitroInstances`, `DenyNonNitroInstances` (obsolete, replaced by
`DenyEC2NonNitroInstances`), and/or `DenyEC2InstancesWithoutEncryptionInTransit`,
you will want to update them to v0.12.0 or choose a difference instance type.

### Migration procedure

To perform the recommended migration, follow these steps:

#### 1. Deploy a minimal node group, move addons to it

Change your `eks/cluster` configuration to set `deploy_addons_to_fargate: false`.

Add the following to your `eks/cluster` configuration, but
copy the block device name, volume size, and volume type from your existing
Karpenter provisioner configuration. Also select the correct `ami_type`
according to the `ami_family` in your Karpenter provisioner configuration.

```yaml
node_groups:
# will create 1 node group for each item in map
# Provision a minimal static node group for add-ons and redundant replicas
main:
# EKS AMI version to use, e.g. "1.16.13-20200821" (no "v").
ami_release_version: null
# Type of Amazon Machine Image (AMI) associated with the EKS Node Group
# Typically AL2_x86_64 or BOTTLEROCKET_x86_64
ami_type: BOTTLEROCKET_x86_64
# Additional name attributes (e.g. `1`) for the node group
attributes: []
# will create 1 auto scaling group in each specified availability zone
# or all AZs with subnets if none are specified anywhere
availability_zones: null
# Whether to enable Node Group to scale its AutoScaling Group
cluster_autoscaler_enabled: false
# True (recommended) to create new node_groups before deleting old ones, avoiding a temporary outage
create_before_destroy: true
# Configure storage for the root block device for instances in the Auto Scaling Group
# For Bottlerocket, use /dev/xvdb. For all others, use /dev/xvda.
block_device_map:
"/dev/xvdb":
ebs:
volume_size: 125 # in GiB
volume_type: gp3
encrypted: true
delete_on_termination: true
# Set of instance types associated with the EKS Node Group. Terraform will only perform drift detection if a configuration value is provided.
instance_types:
- c6a.large
# Desired number of worker nodes when initially provisioned
desired_group_size: 3
max_group_size: 3
min_group_size: 3
resources_to_tag:
- instance
- volume
tags: null
```
You do not need to apply the above changes yet, although you can if you
want to. To reduce overhead, you can apply the changes in the next step.
#### 2. Move Karpenter to the node group, remove legacy support
Delete the `fargate_profiles` section from your `eks/cluster` configuration,
or at least remove the `karpenter` profile from it. Disable legacy support
by adding:

```yaml
legacy_fargate_1_role_per_profile_enabled: false
```

#### 2.a Optional: Move Karpenter instance profile to `eks/cluster` component

If you have the patience to manually import and remove a Terraform
resource, you should move the Karpenter instance profile to the `eks/cluster`
component. This fixes an issue where the Karpenter instance profile
could be broken by certain sequences of Terraform operations.
However, if you have multiple clusters to migrate, this can be tedious,
and the issue is not a serious one, so you may want to skip this step.

To do this, add the following to your `eks/cluster` configuration:

```yaml
legacy_do_not_create_karpenter_instance_profile: false
```


**BEFORE APPLYING CHANGES**:
Run `atmos terraform plan` (with the appropriate arguments) to see the changes
that will be made. Among the resources to be created will be
`aws_iam_instance_profile.default[0]`. Using the same arguments as before, run
`atmos`, but replace `plan` with `import 'aws_iam_instance_profile.default[0]' <profile-name>`,
where `<profile-name>` is the name of the profile the plan indicated it would create.
It will be something like `<cluster-name>-karpenter`.

**NOTE**: If you perform this step, you must also perform 3.a below.

#### 2.b Apply the changes

Apply the changes with `atmos terraform apply`.

#### 3. Upgrade Karpenter

Upgrade the `eks/karpenter` component to the latest version. Follow the upgrade
instructions to enable the new `karpenter-crd` chart by setting `crd_chart_enabled: true`.

Upgrade to at least Karpenter v0.30.0, which is the first version to support
factoring in the existing node group when determining the number of nodes to
provision. This will prevent Karpenter from provisioning nodes when they are not
needed because the existing node group already has enough capacity. Be
careful about upgrading to v0.32.0 or later, as that version introduces
significant breaking changes. We recommend updating to v0.31.2 or later
versions of v0.31.x, but not v0.32.0 or later, as a first step. This
provides a safe (revertible) upgrade path to v0.32.0 or later.

#### 3.a Finish Move of Karpenter instance profile to `eks/cluster` component

If you performed step 2.a above, you must also perform this step. If you did
not perform step 2.a, you must NOT perform this step.

In the `eks/karpenter` stack, set `legacy_create_karpenter_instance_profile: false`.

**BEFORE APPLYING CHANGES**: Remove the Karpenter instance profile from the Terraform state, since
it is now managed by the `eks/cluster` component, or else Terraform will delete it.

```shell
atmos terraform state eks/karpenter rm 'aws_iam_instance_profile.default[0]' -s=<stack-name>
```

#### 3.b Apply the changes

Apply the changes with `atmos terraform apply`.

## Changes included in `v1.303.0`

This is a bug fix and feature enhancement update. No action is necessary to upgrade.

Expand Down
2 changes: 1 addition & 1 deletion modules/eks/cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ locals {
)

# Existing managed worker role ARNs
managed_worker_role_arns = local.eks_outputs.eks_managed_node_workers_role_arns
managed_worker_role_arns = coalesce(local.eks_outputs.eks_managed_node_workers_role_arns, [])

# If Karpenter IAM role is enabled, add it to the `aws-auth` ConfigMap to allow the nodes launched by Karpenter to join the EKS cluster
karpenter_role_arn = one(aws_iam_role.karpenter[*].arn)
Expand Down

0 comments on commit 38fc86b

Please sign in to comment.