-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[occm] Ability to delay node removal #2213
Comments
How does this work? Which component is going to force the update of providerID on a node which already exists? The providerID has the property of being unique, so once set it's treated as canonical everywhere I'm aware of. What mechanism would we use to indicate that this should not happen? Incidentally, anything involving a delay isn't going to be a robust solution. If this use case is accepted then we would need something explicit. The single-node master case is interesting, though. I wonder if there could be some way to cause the node lifecycle manager to ignore specific nodes, allowing the administrator to delete them manually if required. Looking at the code there doesn't seem to be one currently: https://github.com/kubernetes/kubernetes/blob/08755fe2490beb91b5c2866c3413dec043fd72f9/staging/src/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go#L130-L197 This is not an OpenStack-specific issue. I don't think we would address it in occm. There appears to be some related discussion in kubernetes/cloud-provider#35. I would recommend closing this issue and re-opening the discussion there. |
@mdbooth Thanks for the feedback. I dig deeper into the topic:
Instead, I would suggest that the occm does not remove nodes annotated with kubernetes/cloud-provider#35 sounds different to me (nodes with the suggested annotations should still be supervised, just not deleted). |
I think you propose a way to keep the original node even though the underlying instance is gone (with the flag above = flase)
but I don't fully understand why we need rejoin as deleted member, maybe the new node has exactly same member name etc? |
We don't actually implement most of the 'architectural' behaviour in occm. For consistency between cloud providers the majority of the controller, including all its high level control flow, is implemented in cloud-provider. We essentially just implement some interfaces that provide cloud-specific behaviour. You can see how that works in the code I linked above. This is the node lifecycle control loop, and you can see that it calls cloud-specific functions via an interface. This use case does sound interesting, but it's not specific to OpenStack. The logic to implement it would almost certainly have to be in cloud-provider, and would apply to all cloud providers. It might have to extend beyond just 'don't delete this node', because for the cloud provider to continue managing node addresses after the replacement is created we would have to change provider ID, and we'd also need a mechanism to safely do that. It's also possible that with more discussion at the right level it might be possible to do what you need at a different level. If you find the right place to ask the question (I suggest trying cloud-provider next, but they may forward you again), I wouldn't be surprised to discover that there's already an accepted way to achieve it, or at least a number of other people with the same requirements. |
I think this is the right direction and suggestion |
The root issue is that an excluded etcd member cannot rejoin the same etcd cluster. In short, this disallow to have any persistence at the node replacement level. I agree this is out of scope for this repo. However going further with that architecture, I am not convinced this is strong enough need to open an upstream issue, I will rather adapt the logic to work around it. Thanks for the comments. |
/kind feature
What happened:
Currently, occm monitors nodes with a
providerID
and deletes the node from the cluster when the corresponding instance in OpenStack is not existing.What you expected to happen:
In most of our setups, we persist the data of Kubernetes folder (RKE2 in our case) for worst case scenario and/or single master cluster. This means that during an update of the nodes, the OpenStack instance will be destroy and a new one respawned with the same disk/data. The new node cannot then rejoin the cluster as it is identified as a deleted node (e.g. etcd does not accept deleted member).
How to reproduce it:
NA
Anything else we need to know?:
The best implementation would be to have a configurable delay before a node is delete and support the
providerID
to be updated.Environment:
The text was updated successfully, but these errors were encountered: