Skip to content

Commit

Permalink
fix: adjust rollout document. (#616)
Browse files Browse the repository at this point in the history
Co-authored-by: Ryan Zhang <[email protected]>
Co-authored-by: Britania Rodriguez Reyes <[email protected]>
  • Loading branch information
3 people authored Nov 15, 2023
1 parent 59c8f07 commit bfd82c9
Showing 1 changed file with 53 additions and 60 deletions.
113 changes: 53 additions & 60 deletions docs/howtos/crp.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,83 +358,76 @@ After a `ClusterResourcePlacement` is created, you may want to

* Add, update, or remove the resources that have been selected by the
`ClusterResourcePlacement` in the hub cluster
* Add, update, or remove resource selectors in the `ClusterResourcePlacement`
* Add or update the scheduling policy in the `ClusterResourcePlacement`
* Update the resource selectors in the `ClusterResourcePlacement`
* Update the scheduling policy in the `ClusterResourcePlacement`

These changes may trigger the following outcomes:

* New resources may need to placed on all picked clusters
* New resources may need to be placed on all picked clusters
* Resources already placed on a pick cluster may get updated or deleted
* Some clusters picked previously are now unpicked, and resources must be removed from such clusters
* Some clusters are newly picked, and resources must be added to them

Most of these outcomes may lead to service interruptions. Your apps running on the member clusters
may become unavailable temporarily, while Fleet sends out updated resources; clusters that are
now unpicked lose all the placed resources, and traffic sent to these clusters will be lost;
if there are too many newly picked clusters and Fleet places resources on them at the same time,
your backend may get overloaded. The exact pattern of interruption may vary,
depending on the set of resources you place using Fleet.
Most outcomes can lead to service interruptions. Apps running on member clusters may temporarily become
unavailable as Fleet dispatches updated resources. Clusters that are no longer selected will lose all placed resources,
resulting in lost traffic. If too many new clusters are selected and Fleet places resources on them simultaneously,
your backend may become overloaded. The exact interruption pattern may vary depending on the resources you place using Fleet.

To help minimize the interruption, Fleet provides rollout strategy configuration to help you
transition between changes as smoothly as possible. Currently, Fleet supports only one rollout
strategy, rolling update; with this strategy, Fleet will apply changes, including the addition or
removal of picked clusters and resource refreshes, in an incremental manner with a number of phaes
at a pace most appropriate for you. This is the default option and will apply to all
changes you initiated.
To minimize interruption, Fleet allows users to configure the rollout strategy, similar to native Kubernetes deployment,
to transition between changes as smoothly as possible. Currently, Fleet supports only one rollout strategy: rolling update.
This strategy ensures changes, including the addition or removal of selected clusters and resource refreshes,
are applied incrementally in a phased manner at a pace suitable for you. This is the default option and applies to all changes you initiate.

This rollout strategy can be configured with the following parameters:

* `maxUnavailable` controls that, for the selected set of resources, how many clusters may become
unavailable during a change. It can be set as an absolute number or a percentage. Default is 25%,
and you should not use zero for this value.
* `maxUnavailable` determines how many clusters may become unavailable during a change for the selected set of resources.
It can be set as an absolute number or a percentage. The default is 25%, and zero should not be used for this value.

- Setting this parameter to a lower value will result in less interruption during a change but will lead to slower rollouts.

**The less value you set this parameter with, the less interruption you will experience during
a change**; however, this would lead to slower rollouts.
- Fleet considers a cluster as unavailable if resources have not been successfully applied to the cluster.

Note that Fleet considers a cluster as unavailable if resources have not been successfully
applied to the cluster.

<details><summary>How Fleet interprets this value</summary>
<p></p>

Fleet, in actuality, makes sure that at any time, there are **at least** N - `maxUnavailable`
number of clusters available, where N is:

* for scheduling policies of the `PickN` placement type, the `numberOfClusters` value given;
* for scheduling policies of the `PickFixed` placement type, the number of cluster names given;
* for scheduling policies of the `PickAll` placement type, the number of clusters Fleet picks.

If you use a percentage for the `maxUnavailable` parameter, it is calculated against N as
well.

</details>

* `maxSurge` controls how many newly picked clusters will receive resource placements. It can
also be set as an absolute number or a percentage. Default is 25%, and you should not use zero for
this value.

**The less value you set this parameter with, the less new resource placements Fleet will run
at the same time**; however, this would lead to slower rollouts.

<details><summary>How Fleet interprets this value</summary>
<p></p>

Fleet, in actuality, makes sure that at any time, there are **at most** N + `maxSurge`
number of clusters available, where N is:

* for scheduling policies of the `PickN` placement type, the `numberOfClusters` value given;
* for scheduling policies of the `PickFixed` placement type, the number of cluster names given;
* for scheduling policies of the `PickAll` placement type, the number of clusters Fleet picks.
- <details><summary>How Fleet interprets this value</summary>
Fleet, in actuality, makes sure that at any time, there are **at least** N - `maxUnavailable`
number of clusters available, where N is:

* for scheduling policies of the `PickN` placement type, the `numberOfClusters` value given;
* for scheduling policies of the `PickFixed` placement type, the number of cluster names given;
* for scheduling policies of the `PickAll` placement type, the number of clusters Fleet picks.

If you use a percentage for the `maxUnavailable` parameter, it is calculated against N as
well.

</details>

If you use a percentage for the `maxUnavailable` parameter, it is calculated against N as
well.
* `maxSurge` determines the number of additional clusters, beyond the required number, that will receive resource placements.
It can also be set as an absolute number or a percentage. The default is 25%, and zero should not be used for this value.

</details>
- Setting this parameter to a lower value will result in fewer resource placements on additional
clusters by Fleet, which may slow down the rollout process.

* `unavailablePeriodSeconds` controls the frequeny of rollout phases. Default is 60 seconds.
- <details><summary>How Fleet interprets this value</summary>
Fleet, in actuality, makes sure that at any time, there are **at most** N + `maxSurge`
number of clusters available, where N is:

**The less value you set this parameter with, the quicker rollout will become**. However, using
a value that is too little may lead to unexpected service interruptions.
* for scheduling policies of the `PickN` placement type, the `numberOfClusters` value given;
* for scheduling policies of the `PickFixed` placement type, the number of cluster names given;
* for scheduling policies of the `PickAll` placement type, the number of clusters Fleet picks.

If you use a percentage for the `maxUnavailable` parameter, it is calculated against N as well.

</details>

* `unavailablePeriodSeconds` allows users to inform the fleet when the resources are deemed "ready".
The default value is 60 seconds.

- Fleet only considers newly applied resources on a cluster as "ready" once `unavailablePeriodSeconds` seconds
have passed **after** the resources have been **successfully** applied to that cluster.
- Setting a lower value for this parameter will result in faster rollouts. However, we **strongly**
recommend that users set it to a value that all the initialization/preparation tasks can be completed within
that time frame. This ensures that the resources are typically ready after the `unavailablePeriodSeconds` have passed.
- We are currently designing a generic "ready gate" for resources being applied to clusters. Please feel free to raise
issues or provide feedback if you have any thoughts on this.

> Note
>
Expand Down

0 comments on commit bfd82c9

Please sign in to comment.