cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

tommasopozzetti · 2022-12-09T18:16:05Z

What happened?

When running with the in-tree cloud controller manager, each kubelet deals with its own node, allowing different kubelets in a cluster to have different cloud controller manager implementations.
When moving to this external implementation, the controller makes the assumption that every node in the cluster is a VM from vsphere, and if that is not the case it deletes non-found VMs as soon as the nodes are NotReady and it continuously prints out warnings that the VMs are not found when the nodes are Ready.

What did you expect to happen?

The controller should allow multiple cloud providers to coexist in the same cluster. For example in the ClusterAPI case, both VMs and bare metal nodes might be required, and the bare metal nodes might be managed with a different provider such as BYOH or metal3.

Each different provider is already being differentiated by setting a unique providerID prefix on nodes, so it would be enough for the cloud-provider-vsphere to give some time when a new node pops up for it to be initialized before assuming it needs to delete it (in case it does not find that it needs to initialize it itself by finding the corresponding VM), and simply ignore initialized nodes that have a providerID prefix that is not the vsphere one (communicating that some other cloud controller manager is managing those).

How can we reproduce it (as minimally and precisely as possible)?

Create a ClusterAPI cluster with nodes from the CAPV provider and install the cloud-provider-vsphere. Add a node from a different provider (such as BYOH provider).

Notice cloud-provider-vsphere outputting warning logs that the VM cannot be found and deleting the node if it cannot become Ready in time.

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

A current workaround consists in disabling node deletion for cloud-provider-vsphere by setting env variable SKIP_NODE_DELETION as explained in here which allows the nodes to coexist by avoiding deletion when the bare metal nodes are NotReady, but it still generates lots of warning logs from cloud-provider-vsphere about VMs not being found. Furthermore, this workaround removes the useful feature of node deletion when a VM is deleted from the provider (though a similar behavior can be achieved, at least in the clusterAPI case, through the CAPI primitives instead and the CAPV provider).

Kubernetes version

1.23.5

Cloud provider or hardware configuration

vSphere 7.0.3.00700

OS version

No response

Kernel (e.g. `uname -a`)

No response

Install tools

No response

Container runtime (CRI) and and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Others

No response

The text was updated successfully, but these errors were encountered:

lubronzhan · 2022-12-16T06:17:22Z

Hi @tommasopozzetti it's a common issue on out of tree Cloud provider interface not just on CP vsphere
Check out this issue kubernetes/kubernetes#88820 and this issue kubernetes/cloud-provider#35
It was proposed before but never got to work on.
I would say this is a feature improvement since we never claim CPI supporting single cluster with machines from multiple infra. Each cloud provider by default only manage the mapping between node to machine on the existing infrastructure.

Let me check if it's possible to ignore node. probably not since CPI will just delete those unless we throw error.
https://github.com/kubernetes/cloud-provider/blob/master/controllers/nodelifecycle/node_lifecycle_controller.go#L155-L197
We can't just simply ask CP vsphere to tell CPI that node exist since CPV is not the one that's responsible for the third party node. This requires CPI https://github.com/kubernetes/cloud-provider, not just cloud-provider-vsphere level change.

I would bring this up in the sig-cloud-provider meeting. Thanks for mentioning this

tommasopozzetti · 2022-12-16T15:26:47Z

Hi @lubronzhan thank you for taking a look and for the extra info!

I think your original proposal of an alpha feature would be reasonable though. While I agree that this is a shared effort between the CPI and CP vsphere, the assumption is that CPI will be running also in the other provider's cloud controller so it is still vsphere CP responsibility I think to ignore the nodes that are not specific to this CPI implementation to avoid throwing error for nodes that are not registered with this cloud provider.

The issue that remains and that indeed would be more on the CPI side would be that of node deletion (aka what to do for a node that is not registered yet and is found by vsphere not to be an existing VM). I think in the grand scheme of things, implementing something like providerClasses would be the right way to go, but in the meantime even just a simple (reasonable and possibly configurable as another alpha feature disabled by default) timeout before deleting nodes (to give them time to initialize if managed by another provider) would solve the issue.
CP vsphere can influence the decision on node deletion as you said by throwing proper errors like in here. It could define two errors like NodeInitializing and NodeNotManaged to throw if a node is not found as a VM and we are still in the timeout zone or if a node is initialized with a different providerID. After timeout, if no other providerID has been set on the node, then CP vsphere can proceed with normal deletion.

Let me know what you think!

k8s-triage-robot · 2023-03-16T15:33:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-04-15T15:55:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-05-15T16:10:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-05-15T16:10:43Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tommasopozzetti added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 16, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

tommasopozzetti commented Dec 9, 2022

lubronzhan commented Dec 16, 2022 •

edited

Loading

tommasopozzetti commented Dec 16, 2022

k8s-triage-robot commented Mar 16, 2023

k8s-triage-robot commented Apr 15, 2023

k8s-triage-robot commented May 15, 2023

k8s-ci-robot commented May 15, 2023

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

Comments

tommasopozzetti commented Dec 9, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

Kubernetes version

Cloud provider or hardware configuration

OS version

Kernel (e.g. uname -a)

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Others

lubronzhan commented Dec 16, 2022 • edited Loading

tommasopozzetti commented Dec 16, 2022

k8s-triage-robot commented Mar 16, 2023

k8s-triage-robot commented Apr 15, 2023

k8s-triage-robot commented May 15, 2023

k8s-ci-robot commented May 15, 2023

Kernel (e.g. `uname -a`)

lubronzhan commented Dec 16, 2022 •

edited

Loading