Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace #574

hwo-wd · 2024-09-09T11:52:28Z

Rancher Server Setup

Rancher version: 2.9.1
Installation option (Docker install/Helm Chart): helm, 104.0.1+up5.0.1
Kubernetes Version and Engine: v1.28.12, rke1

Describe the bug
Creating a provisioning.v2 cluster (e.g., via gitops) in a namespace different than fleet-default, creating a backup, pruning any Rancher resources and then restoring leads to said cluster being in an irrecoverably (?) state:

Rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

To Reproduce
Steps to reproduce the behavior:

Create a cluster not residing in the fleet-default namespace and let it be provisioned using CAPI
Create a backup
Cleanup any Rancher resources: kubectl apply -f https://raw.githubusercontent.com/rancher/rancher-cleanup/main/deploy/rancher-cleanup.yaml; note that this will delete the machine-plan secret, even-though it resides in a non-Rancher-default namespace
Restore from the backup created in 2. above
Observe the error that the downstream cluster's system agent (which is working just fine) is no longer able to connect to upstream; after investigation this stems due to the -machine-plan$ secret not residing in the fleet-default namespace.

A possible fix, which is tough to maintain over time until #487 gets a thing, is to broaden the backup of the machine-plan by creating a new ResourceSet and enhancing the existing one by the following:

- apiVersion: v1
  kindsRegexp: ^secrets$
  namespaceRegexp: "^.*"
  resourceNameRegexp: machine-plan$|rke-state$|machine-state$|machine-driver-secret$|machine-provision$|^harvesterconfig|^registryconfig-auth

This way, the important machine-plan secret is part of the backup and gets restored, thus the downstream cluster system agent can connect just fine.

Expected behavior

machine-plan secrets are essential and should be backed up independent of the namespace they reside in

Note: I'd be happy to contribute a PR, I just don't know whether the namespaceRegexp: "^.*" might be too generic in your taste, albeit the resource selectors are still quite specific

The text was updated successfully, but these errors were encountered:

… renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace rancher#574)

mallardduck · 2024-09-11T20:12:27Z

hey @hwo-wd - I don't need the secret itself, but can you provide me with a YAML representation (names and relevant metadata) of the situation you are encountering here? Having a bit more context could be very helpful here to get us up to speed without replication.

Also if you are a paying Rancher subscriber, please open a support case and provide your Support Engineer reference to this issue. They can create an internal issue to mirror it and which can help expedite the investigation and resolution of issues like this. As your Support Engineer can securely collect more specific details and provide them to us.

hwo-wd · 2024-09-12T06:19:37Z

Thanks for coming back to me, Dan.
Sorry for being unclear, I'll to elaborate my situation a bit more:
I have a setup using flux cd to create a new cluster from scratch: basically I'm creating two custom resources: (1) provisioning.cattle.io/Cluster and (2) rke-machine-config.cattle.io/VmwarevsphereConfig which makes Rancher kicking in and provisioning the cluster. The thing is, in order to support multiple clusters, I'm provisioning each cluster in it's own namespace, thus allowing for easy separation w/o having to deal with naming conflicts etc. (since ServiceAccounts etc. come into play, too)
From what I've seen it's not possible to have a separate namespace per cluster via the UI: there is no namespace selector and each cluster gets implicitly provisioned in fleet-default.
Anyway, everything is working nicely, until it comes to backup scenario: the essential secrets, shown in the following screen shots are NOT backed up by the default ResourceSet since they don't reside in fleet-default, but e.g. in namespace spongebob instead; the fix is easy enough: #575 and makes the restore procedure working like a charm.

mallardduck · 2024-09-23T17:15:55Z

@hwo-wd - I think that for the time being you should feel comfortable making modifications to your Rancher Backups ResourceSet to work around this difficulty. Editing this is acceptable for workarounds of this nature and when users seek to backup resources not created directly by Rancher. So this falls under a little bit of both given your use case with flux.

Our team will triage the issue further and likely investigate it as part of our on going effort to audit and improve fleet related integrations. While the fix you propose seems easy enough, for other users it could have unintended affects.

github-actions bot added the team/observability&backup label Sep 9, 2024

hwo-wd added a commit to hwo-wd/backup-restore-operator that referenced this issue Sep 9, 2024

Backup machine-plan secrets independent of namespace (close Restore…

25c7cef

… renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace rancher#574)

hwo-wd mentioned this issue Sep 9, 2024

Backup machine-plan secrets independent of namespace (close Restore renders downstream cluster unusable if this one resides in a non-fleet-default namespace #574) #575

Closed

mallardduck added the fleet Related to fleet integration label Sep 23, 2024

mallardduck mentioned this issue Oct 30, 2024

[RFE] Provide 2 default ResourceSet configs out of the box based on use-case #607

Open

github-actions bot added the status/stale label Nov 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024

mallardduck removed the status/stale label Dec 18, 2024

mallardduck reopened this Dec 18, 2024

rancher deleted a comment from github-actions bot Dec 18, 2024

mallardduck added the kind/bug label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace #574

Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace #574

hwo-wd commented Sep 9, 2024 •

edited

Loading

mallardduck commented Sep 11, 2024 •

edited

Loading

hwo-wd commented Sep 12, 2024

mallardduck commented Sep 23, 2024

Restore renders downstream cluster unusable if this one resides in a non-fleet-default namespace #574

Restore renders downstream cluster unusable if this one resides in a non-fleet-default namespace #574

Comments

hwo-wd commented Sep 9, 2024 • edited Loading

mallardduck commented Sep 11, 2024 • edited Loading

hwo-wd commented Sep 12, 2024

mallardduck commented Sep 23, 2024

Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace #574

Restore renders downstream cluster unusable if this one resides in a non-`fleet-default` namespace #574

hwo-wd commented Sep 9, 2024 •

edited

Loading

mallardduck commented Sep 11, 2024 •

edited

Loading