Autonomous control plane #235

majst01 · 2024-12-04T14:21:10Z

Description

@mwindower @robertvolkmann @Gerrit91 @vknabel @simcod first version, would love to get initial feedback if this is at least on the right track

TODO:

Add a section which failures are possible and which services are affected.
Add kind cluster upgrade description and test

metal-robot · 2024-12-04T14:21:14Z

Thanks for contributing a pull request to the metal-stack docs!

A rendered preview of your changes will be available at: https://docs.metal-stack.io/previews/PR235/

robertvolkmann · 2024-12-04T17:28:50Z

I struggle to understand why we want to deploy two separate single-node kind clusters instead of deploying a multi-node cluster. DRBD and pacemaker are new technologies for me, but a multi-node cluster is well-known for me.

My gut feeling leans more towards creating just a bootstrap cluster that will be deleted after everything is set up, but I didn't have time to think it through.

majst01 · 2024-12-04T21:05:38Z

The two node cluster setup is just one possibility i proposed. I tried to give a minimal example which gives some sort of HA. This could of course be improved. The most important point here is that the etcd of of the kind cluster and the garden-api-server is backed up on a regular short interval to a external S3 storage. Loosing this kind cluster, or the machine it is executed leads to the inability to manage the control plane of the upper layer, but not the main partition.

Gerrit91

For now only typos. :D

docs/src/installation/autonomous-control-plane.md

majst01 · 2024-12-09T14:40:07Z

For now only typos. :D

Thanks

chbmuc · 2024-12-09T16:24:45Z

Good proposal. Two considerations from my side:

I'm a bit reluctant regarding the persistent storage for the needle nodes. I think we should prefer an external storage system over the suggested DRBD solution. I think it would be best to define a protocol (maybe something basic like iSCSI) that should be used and a set of volumes that should be provided by the storage system. This way we would not require a specific solution and let the datacenter provide the volumes on whatever system it has available. If there is no existing storage system, we still cloud go for a (synology) appliance.
It is'n clear to me, what happens, when the kind node fails. Don't we need an extra failover mechanism? Or ist this something you wanted to solve with pacemaker?

majst01 · 2024-12-10T07:51:41Z

Good proposal. Two considerations from my side:

I'm a bit reluctant regarding the persistent storage for the needle nodes. I think we should prefer an external storage system over the suggested DRBD solution. I think it would be best to define a protocol (maybe something basic like iSCSI) that should be used and a set of volumes that should be provided by the storage system. This way we would not require a specific solution and let the datacenter provide the volumes on whatever system it has available. If there is no existing storage system, we still cloud go for a (synology) appliance.

The DRBD solution was only meant as a redundancy solution for the kind stateful sets like etcd, metal-db and other, but not for the needle nodes. For them i am completely with you that this should be provided from external. But OTOH it would also possible to use the local NVMe drives of the needle nodes for storage if we create the always shoots with HA-Control Plane.

It is'n clear to me, what happens, when the kind node fails. Don't we need an extra failover mechanism? Or ist this something you wanted to solve with pacemaker?

If the kind node fails, the managed needle clusters are still working, but managing of them is not possible. Access to the needle seed is not possible, but needle shoot access is.

I need to add a description of possible failure scenarios to make this clearer

vknabel

Great summary, though it feels a little bit complicated for the fact that we are trying to eat our own dog food. Just a feeling that we are misusing something, which isn't necessarily true.

Unsure about kind vs k3s or alternatives, but this decision can be postponed until the general architecture is decided on.

I haven't looked into DRBD and pacemaker, but this feels like a whole new field that needs to be handled and learned by anyone in ermergency teams

docs/src/installation/autonomous-control-plane.md

vknabel · 2024-12-11T13:59:44Z

docs/src/installation/autonomous-control-plane.md

+
+## Open Topics
+
+- Naming of the metal-stack chain elements, is `needle` and `nail` appropriate ?


What about a semantic naming scheme? Like bootstrap cluster that creates the control cluster?

majst01 · 2024-12-11T14:28:30Z

Great summary, though it feels a little bit complicated for the fact that we are trying to eat our own dog food. Just a feeling that we are misusing something, which isn't necessarily true.

Unsure about kind vs k3s or alternatives, but this decision can be postponed until the general architecture is decided on.

my actual feeling tends more to k3s TBH, but as you said, this decision can be done later

I haven't looked into DRBD and pacemaker, but this feels like a whole new field that needs to be handled and learned by anyone in ermergency teams

This is the case for any solution we want to introduce to make the needle HA. This is my first attempt for a quite simple one. I think we can even burn a ISO for the needle servers

Co-authored-by: Gerrit <[email protected]>

Co-authored-by: Valentin Knabel <[email protected]>

majst01 marked this pull request as ready for review December 9, 2024 07:19

majst01 requested a review from a team as a code owner December 9, 2024 07:19

majst01 requested a review from Gerrit91 December 9, 2024 07:19

majst01 assigned chbmuc Dec 9, 2024

Gerrit91 reviewed Dec 9, 2024

View reviewed changes

Gerrit91 requested review from vknabel, mwennrich and mwindower December 10, 2024 13:22

vknabel reviewed Dec 11, 2024

View reviewed changes

majst01 force-pushed the autonomous-control-plane branch from 609e362 to 674ecf7 Compare January 20, 2025 07:53

majst01 and others added 13 commits January 20, 2025 09:44

Hello metal stack chain

9d5bc47

Physical view

e8eb4a5

Physical view

2753e18

Physical view

401a66c

More references

1290dde

More storage work

6effdfc

move to installation

bf062e6

Cleanup

7ad42ea

Link

8c594fc

Link

da25057

Better english

f37cd34

URL Fixes

46eed9b

Update docs/src/installation/autonomous-control-plane.md

6a1d6cd

Co-authored-by: Gerrit <[email protected]>

majst01 and others added 10 commits January 20, 2025 09:44

Update docs/src/installation/autonomous-control-plane.md

7c55aa7

Co-authored-by: Gerrit <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

d2ee5f6

Co-authored-by: Gerrit <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

8bf4c2d

Co-authored-by: Gerrit <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

e949ccf

Co-authored-by: Gerrit <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

ef4638d

Co-authored-by: Gerrit <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

549f177

Co-authored-by: Valentin Knabel <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

c4c1380

Co-authored-by: Valentin Knabel <[email protected]>

Update docs/src/installation/autonomous-control-plane.md

847b02a

Co-authored-by: Valentin Knabel <[email protected]>

Failure scenarios

0ec7f0d

kind -> k3s

76016e5

majst01 force-pushed the autonomous-control-plane branch from b1eff15 to 76016e5 Compare January 20, 2025 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autonomous control plane #235

Autonomous control plane #235

majst01 commented Dec 4, 2024 •

edited

Loading

metal-robot bot commented Dec 4, 2024

robertvolkmann commented Dec 4, 2024

majst01 commented Dec 4, 2024

Gerrit91 left a comment

majst01 commented Dec 9, 2024

chbmuc commented Dec 9, 2024

majst01 commented Dec 10, 2024 •

edited

Loading

vknabel left a comment

vknabel Dec 11, 2024

majst01 commented Dec 11, 2024


		## Open Topics

		- Naming of the metal-stack chain elements, is `needle` and `nail` appropriate ?

Autonomous control plane #235

Are you sure you want to change the base?

Autonomous control plane #235

Conversation

majst01 commented Dec 4, 2024 • edited Loading

Description

metal-robot bot commented Dec 4, 2024

robertvolkmann commented Dec 4, 2024

majst01 commented Dec 4, 2024

Gerrit91 left a comment

Choose a reason for hiding this comment

majst01 commented Dec 9, 2024

chbmuc commented Dec 9, 2024

majst01 commented Dec 10, 2024 • edited Loading

vknabel left a comment

Choose a reason for hiding this comment

vknabel Dec 11, 2024

Choose a reason for hiding this comment

majst01 commented Dec 11, 2024

majst01 commented Dec 4, 2024 •

edited

Loading

majst01 commented Dec 10, 2024 •

edited

Loading