Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' #677

Closed
tommasopozzetti opened this issue Dec 9, 2022 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tommasopozzetti
Copy link

What happened?

When running with the in-tree cloud controller manager, each kubelet deals with its own node, allowing different kubelets in a cluster to have different cloud controller manager implementations.
When moving to this external implementation, the controller makes the assumption that every node in the cluster is a VM from vsphere, and if that is not the case it deletes non-found VMs as soon as the nodes are NotReady and it continuously prints out warnings that the VMs are not found when the nodes are Ready.

What did you expect to happen?

The controller should allow multiple cloud providers to coexist in the same cluster. For example in the ClusterAPI case, both VMs and bare metal nodes might be required, and the bare metal nodes might be managed with a different provider such as BYOH or metal3.

Each different provider is already being differentiated by setting a unique providerID prefix on nodes, so it would be enough for the cloud-provider-vsphere to give some time when a new node pops up for it to be initialized before assuming it needs to delete it (in case it does not find that it needs to initialize it itself by finding the corresponding VM), and simply ignore initialized nodes that have a providerID prefix that is not the vsphere one (communicating that some other cloud controller manager is managing those).

How can we reproduce it (as minimally and precisely as possible)?

Create a ClusterAPI cluster with nodes from the CAPV provider and install the cloud-provider-vsphere. Add a node from a different provider (such as BYOH provider).

Notice cloud-provider-vsphere outputting warning logs that the VM cannot be found and deleting the node if it cannot become Ready in time.

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

A current workaround consists in disabling node deletion for cloud-provider-vsphere by setting env variable SKIP_NODE_DELETION as explained in here which allows the nodes to coexist by avoiding deletion when the bare metal nodes are NotReady, but it still generates lots of warning logs from cloud-provider-vsphere about VMs not being found. Furthermore, this workaround removes the useful feature of node deletion when a VM is deleted from the provider (though a similar behavior can be achieved, at least in the clusterAPI case, through the CAPI primitives instead and the CAPV provider).

Kubernetes version

1.23.5

Cloud provider or hardware configuration

vSphere 7.0.3.00700

OS version

No response

Kernel (e.g. uname -a)

No response

Install tools

No response

Container runtime (CRI) and and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Others

No response

@tommasopozzetti tommasopozzetti added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2022
@lubronzhan
Copy link
Contributor

lubronzhan commented Dec 16, 2022

Hi @tommasopozzetti it's a common issue on out of tree Cloud provider interface not just on CP vsphere
Check out this issue kubernetes/kubernetes#88820 and this issue kubernetes/cloud-provider#35
It was proposed before but never got to work on.
I would say this is a feature improvement since we never claim CPI supporting single cluster with machines from multiple infra. Each cloud provider by default only manage the mapping between node to machine on the existing infrastructure.

Let me check if it's possible to ignore node. probably not since CPI will just delete those unless we throw error.
https://github.com/kubernetes/cloud-provider/blob/master/controllers/nodelifecycle/node_lifecycle_controller.go#L155-L197
We can't just simply ask CP vsphere to tell CPI that node exist since CPV is not the one that's responsible for the third party node. This requires CPI https://github.com/kubernetes/cloud-provider, not just cloud-provider-vsphere level change.

I would bring this up in the sig-cloud-provider meeting. Thanks for mentioning this

@tommasopozzetti
Copy link
Author

Hi @lubronzhan thank you for taking a look and for the extra info!

I think your original proposal of an alpha feature would be reasonable though. While I agree that this is a shared effort between the CPI and CP vsphere, the assumption is that CPI will be running also in the other provider's cloud controller so it is still vsphere CP responsibility I think to ignore the nodes that are not specific to this CPI implementation to avoid throwing error for nodes that are not registered with this cloud provider.

The issue that remains and that indeed would be more on the CPI side would be that of node deletion (aka what to do for a node that is not registered yet and is found by vsphere not to be an existing VM). I think in the grand scheme of things, implementing something like providerClasses would be the right way to go, but in the meantime even just a simple (reasonable and possibly configurable as another alpha feature disabled by default) timeout before deleting nodes (to give them time to initialize if managed by another provider) would solve the issue.
CP vsphere can influence the decision on node deletion as you said by throwing proper errors like in here. It could define two errors like NodeInitializing and NodeNotManaged to throw if a node is not found as a VM and we are still in the timeout zone or if a node is initialized with a different providerID. After timeout, if no other providerID has been set on the node, then CP vsphere can proceed with normal deletion.

Let me know what you think!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 16, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants