Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autonomous control plane #235

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Autonomous control plane #235

wants to merge 23 commits into from

Conversation

majst01
Copy link
Contributor

@majst01 majst01 commented Dec 4, 2024

Description

@mwindower @robertvolkmann @Gerrit91 @vknabel @simcod first version, would love to get initial feedback if this is at least on the right track

TODO:

  • Add a section which failures are possible and which services are affected.
  • Add kind cluster upgrade description and test

@metal-robot
Copy link
Contributor

metal-robot bot commented Dec 4, 2024

Thanks for contributing a pull request to the metal-stack docs!

A rendered preview of your changes will be available at: https://docs.metal-stack.io/previews/PR235/

@robertvolkmann
Copy link
Contributor

I struggle to understand why we want to deploy two separate single-node kind clusters instead of deploying a multi-node cluster. DRBD and pacemaker are new technologies for me, but a multi-node cluster is well-known for me.

My gut feeling leans more towards creating just a bootstrap cluster that will be deleted after everything is set up, but I didn't have time to think it through.

@majst01
Copy link
Contributor Author

majst01 commented Dec 4, 2024

The two node cluster setup is just one possibility i proposed. I tried to give a minimal example which gives some sort of HA. This could of course be improved. The most important point here is that the etcd of of the kind cluster and the garden-api-server is backed up on a regular short interval to a external S3 storage. Loosing this kind cluster, or the machine it is executed leads to the inability to manage the control plane of the upper layer, but not the main partition.

@majst01 majst01 marked this pull request as ready for review December 9, 2024 07:19
@majst01 majst01 requested a review from a team as a code owner December 9, 2024 07:19
@majst01 majst01 requested a review from Gerrit91 December 9, 2024 07:19
Copy link
Contributor

@Gerrit91 Gerrit91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now only typos. :D

docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
@majst01
Copy link
Contributor Author

majst01 commented Dec 9, 2024

For now only typos. :D

Thanks

@chbmuc
Copy link

chbmuc commented Dec 9, 2024

Good proposal. Two considerations from my side:

  • I'm a bit reluctant regarding the persistent storage for the needle nodes. I think we should prefer an external storage system over the suggested DRBD solution. I think it would be best to define a protocol (maybe something basic like iSCSI) that should be used and a set of volumes that should be provided by the storage system. This way we would not require a specific solution and let the datacenter provide the volumes on whatever system it has available. If there is no existing storage system, we still cloud go for a (synology) appliance.

  • It is'n clear to me, what happens, when the kind node fails. Don't we need an extra failover mechanism? Or ist this something you wanted to solve with pacemaker?

@majst01
Copy link
Contributor Author

majst01 commented Dec 10, 2024

Good proposal. Two considerations from my side:

  • I'm a bit reluctant regarding the persistent storage for the needle nodes. I think we should prefer an external storage system over the suggested DRBD solution. I think it would be best to define a protocol (maybe something basic like iSCSI) that should be used and a set of volumes that should be provided by the storage system. This way we would not require a specific solution and let the datacenter provide the volumes on whatever system it has available. If there is no existing storage system, we still cloud go for a (synology) appliance.

The DRBD solution was only meant as a redundancy solution for the kind stateful sets like etcd, metal-db and other, but not for the needle nodes. For them i am completely with you that this should be provided from external. But OTOH it would also possible to use the local NVMe drives of the needle nodes for storage if we create the always shoots with HA-Control Plane.

  • It is'n clear to me, what happens, when the kind node fails. Don't we need an extra failover mechanism? Or ist this something you wanted to solve with pacemaker?

If the kind node fails, the managed needle clusters are still working, but managing of them is not possible. Access to the needle seed is not possible, but needle shoot access is.

I need to add a description of possible failure scenarios to make this clearer

Copy link
Contributor

@vknabel vknabel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great summary, though it feels a little bit complicated for the fact that we are trying to eat our own dog food. Just a feeling that we are misusing something, which isn't necessarily true.

Unsure about kind vs k3s or alternatives, but this decision can be postponed until the general architecture is decided on.

I haven't looked into DRBD and pacemaker, but this feels like a whole new field that needs to be handled and learned by anyone in ermergency teams

docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved
docs/src/installation/autonomous-control-plane.md Outdated Show resolved Hide resolved

## Open Topics

- Naming of the metal-stack chain elements, is `needle` and `nail` appropriate ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a semantic naming scheme? Like bootstrap cluster that creates the control cluster?

@majst01
Copy link
Contributor Author

majst01 commented Dec 11, 2024

Great summary, though it feels a little bit complicated for the fact that we are trying to eat our own dog food. Just a feeling that we are misusing something, which isn't necessarily true.

Unsure about kind vs k3s or alternatives, but this decision can be postponed until the general architecture is decided on.

my actual feeling tends more to k3s TBH, but as you said, this decision can be done later

I haven't looked into DRBD and pacemaker, but this feels like a whole new field that needs to be handled and learned by anyone in ermergency teams

This is the case for any solution we want to introduce to make the needle HA. This is my first attempt for a quite simple one. I think we can even burn a ISO for the needle servers

@majst01 majst01 force-pushed the autonomous-control-plane branch from 609e362 to 674ecf7 Compare January 20, 2025 07:53
@majst01 majst01 force-pushed the autonomous-control-plane branch from b1eff15 to 76016e5 Compare January 20, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants