cloud-controller-manager should be able to ignore nodes #35

andrewsykim · 2019-05-28T17:44:01Z

Continuing the discussion from kubernetes/kubernetes#73171, the CCM should have a mechanism to "ignore" a node in a cluster, either because it doesn't belong to a cloud provider or is not a node in the traditional sense (e.g. virtual kubelet). See the PR for more discussion

fejta-bot · 2019-10-08T21:25:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

andrewsykim · 2019-10-16T19:47:16Z

/remove-lifecycle stale

cheftako · 2019-10-16T20:55:29Z

/lifecycle frozen

aoxn · 2020-02-14T03:09:28Z

@andrewsykim There are some scenarios that ccm should ignore nodes. eg. virtual-kubelet , edge node, datacenter nodes in hybrid cluster .
we should come up with more generous way to ignore those nodes.
alibaba cloudprovider use service.beta.kubernetes.io/exclude-node in node labels to exclude node from ccm.
Any thoughts?

andrewsykim · 2020-04-15T20:29:40Z

@timoreimann I recall having a conversation about supporting multiple CCMs in a cluster, this is somewhat related. Are you interested in doing this work?

timoreimann · 2020-04-28T05:39:06Z

@andrewsykim yes, I'm very interested as it'd help us at DigitalOcean to ease testing. Though my intent is to go beyond just nodes and include load balancers as well. kubernetes/kubernetes#88820 is the ticket I filed for the wider purpose, and kubernetes/kubernetes#88820 (comment) has the summary of our discussion in one of the SIG meetings.

Feel free to assign me to either / all tickets.

morenweld · 2020-07-21T08:24:31Z

hi everybody. I'm also looking at the way how to ignore some nodes on AWS. May I ask, do you know the solution to that?

pires · 2021-01-28T23:58:36Z

AFAIK there's still none!

twexler · 2022-03-17T17:21:14Z

Bumping this up as it hasn't seen any love in a while. This is super useful to my company, as we would like to be able to operate hybrid clusters (openstack and bare-metal in our case) while still being able to use cloud-controller-manager.

I'd be happy to contribute to this effort, just don't know where to start. A KEP, perhaps?

olemarkus · 2022-03-17T19:18:24Z

It comes down to how to identify when a node is owned by which CCM. AWS has some notion that nodes should be prefixed either with ip- or i-, but that is a poor heuristic.

It may be that a KEP is needed to introduce a flag to kubelet that will add something to the created node object hinting at what CCM should own it, and that all CCMs then implement support for ignoring the hint if set to another value than itself.

Not too unrelated to this is the ability to run multiple AWS CCMs for having nodes in multiple regions or accounts.

twexler · 2022-03-18T14:13:21Z

That's a good point, it does mesh really nicely with allowing multiple CCMs (AWS or otherwise) to manage a single cluster.

I was more approaching the idea of having an annotation on a node that indicates which CCM it should belong to, but we'd need a reproducible(?) way to identify CCMs... could be done as a simple argument to the CCM, or...?

olemarkus · 2022-03-18T14:28:11Z

Sounds like something similar to LoadBalancerClass and IngressClass.

twexler · 2022-03-18T14:55:28Z

Yeah, feels very similar. I like that parallel a lot.

zhaodiaoer · 2022-08-12T09:36:50Z

Hi all, any update on this issue?

james-callahan · 2023-01-12T12:35:25Z

How are people doing multi-cloud kubernetes clusters without this solved?

It comes down to how to identify when a node is owned by which CCM. AWS has some notion that nodes should be prefixed either with ip- or i-, but that is a poor heuristic.

Why attempt to do it based on node name? Instead do it based on label or annotation:

Nominate a new label to used on nodes, e.g. node.kubernetes.io/cloud-provider: aws
This label should be added by users via extra arguments to kubelet
A CCM should never initialize or delete a node with a label that doesn't match its own --cloud-provider argument

olemarkus · 2023-01-12T12:42:02Z

This label should be added by users via extra arguments to kubelet

I don't think the underlying machine should be trusted to set this correctly for the same reason other k8s-namespaced labels are not allowed. It should be done by the provisioning/installer mechanism that handles things like the role labels.

A CCM should never initialize or delete a node with a label that doesn't match its own --cloud-provider argument

If one would want multi-region AWS, one would need multiple AWS CCMs, so this doesn't quite work. But some similar flag certainly.

sergelogvinov · 2023-01-12T12:57:29Z

How i solved this issue:

Use Talos as kubernetes solution
Talos CCM only initializes the nodes and sets the ProviderID string.
Native CCM (from cloud provider) launch only as --controllers=cloud-node-lifecycle

I did not try to use routing/loadbalancing thought kubernetes resources. And I think, it will be very complicated.

james-callahan · 2023-01-12T13:25:36Z

Interesting idea @sergelogvinov; I'm already using talos so trying to figure out how that would work.

Native CCM (from cloud provider) launch only as --controllers=cloud-node-lifecycle

Looking at least the aws v2 code InstanceExists returns false for any non-aws nodes: https://github.com/kubernetes/cloud-provider-aws/blob/10ec1f461d50e7413fa8c97baefd8db24c1f9d8a/pkg/providers/v2/instances.go#L77

But in the cloud-node-lifecycle controller, doesn't it proceed to delete the node as soon as InstanceExists returns false?

cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Line 222 in 97fdc45

return instanceV2.InstanceExists(ctx, node)
called by

cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Line 155 in 97fdc45

exists, err := ensureNodeExistsByProviderID(ctx, c.cloud, node)
which would proceed to delete the node at

cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Line 176 in 97fdc45

if err := c.kubeClient.CoreV1().Nodes().Delete(ctx, node.Name, metav1.DeleteOptions{}); err != nil {

Do you actually have this working?

sergelogvinov · 2023-01-12T13:31:26Z

I did not try AWS, this is in my nearest plan-list.
Unfortunately, sometimes you need to add a few if/else lines to native CCM.
to save a time you can check it out https://github.com/sergelogvinov/terraform-talos (this is my research)

olemarkus · 2023-01-12T13:34:11Z

Nevermind the v2 code in AWS CCM. That one is on ice and probably should be removed. But I doubt v1 is any better. But I am happy to support changes in this direction.

However, the more generic support (the mentioned flag and logic for whether the CCM interface is being interacted with) should be added to this repos. If we are lucky, it might be that all CCMs using this lib doesn't need any changes then.

james-callahan · 2023-01-12T13:41:58Z

Nevermind the v2 code in AWS CCM. That one is on ice and probably should be removed.

oh? I didn't realise it wasn't ready for use. Could you share some info on that?
Are you speaking of v2 code in general? or AWS in particular?

But I doubt v1 is any better.

Indeed. if the instance is not found for the current cloud provider then ensureNodeExistsByProviderID likewise returns false and the same node deletion should happen (with my understanding/reading).

cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Lines 235 to 236 in 97fdc45

    
           if err == cloudprovider.InstanceNotFound { 
        
           	return false, nil

However, the more generic support (the mentioned flag and logic for whether the CCM interface is being interacted with) should be added to this repos. If we are lucky, it might be that all CCMs using this lib doesn't need any changes then.

The logic of ensureNodeExistsByProviderID returning false leading to node deletion seems to be an issue that must be fixed in the cloud node lifecycle controller.

olemarkus · 2023-01-12T13:50:41Z

Nevermind the v2 code in AWS CCM. That one is on ice and probably should be removed.

oh? I didn't realise it wasn't ready for use. Could you share some info on that? Are you speaking of v2 code in general? or AWS in particular?

v2 was an idea to make CCM more modern using CRDs for configuration and such. But as you can see from the git history, pretty much nothing has happened to it, while v1 is more actively maintained.

AWS CCM should absolutely be used in favour of the in-tree provider. kOps has been using by default since 1.24.

But I doubt v1 is any better.

Indeed. if the instance is not found for the current cloud provider then ensureNodeExistsByProviderID likewise returns false and the same node deletion should happen (with my understanding/reading).

cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Lines 235 to 236 in 97fdc45

if err == cloudprovider.InstanceNotFound {

return false, nil

However, the more generic support (the mentioned flag and logic for whether the CCM interface is being interacted with) should be added to this repos. If we are lucky, it might be that all CCMs using this lib doesn't need any changes then.

The logic of ensureNodeExistsByProviderID returning false leading to node deletion seems to be an issue that must be fixed in the cloud node lifecycle controller.

I am thinking this should not be called if the node has a different label/class than what's passed in the flag.

goraxe · 2024-06-06T15:00:15Z

I have been thinking about this issue as I currently run an on prem with vultr setup and every so often the vultr ccm deletes all the on prem nodes. I want to add additional providers / regions.

I think core issue comes down to node provance and attestation of the in the node lifecycle controller. Ie can it trust node supplied data about being in cloud and managed via another ccm or manually configured and to be left alone.

This lead me to thinking of using spire/spiffe attestation however if pre joined spire-agent may not be deployed and creating a hard dependency on spiffe might not fit all environments... However attaching information to the node object and validating it I think can be done with existing machinery of addmission/validation webhook with object filtering (ie route delete requests via a finializer) possibly hijacking the token review to validate the node object was modified by a trusted ccm

Thoughts?

chamarakera · 2024-11-27T06:53:27Z

I’ve got a stretch cluster running across AWS and on-prem. The AWS CCM keeps deleting the on-prem nodes when they join to the cluster because they’re not part of the cloud provider.

Currently, I'm patching the AWS CCM DaemonSet to stop the CCM from running in the cluster when an on-prem node wants to joins the cluster then re-patch to start the CCM back in the cluster.

e.g.,

# Stop CCM
kubectl -n kube-system patch daemonset aws-cloud-controller-manager -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

# Start CCM
kubectl -n kube-system patch daemonset aws-cloud-controller-manager --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'

andrewsykim added this to the Next milestone May 28, 2019

andrewsykim mentioned this issue May 28, 2019

ccm: ignore nodes not present in cloud-provider kubernetes/kubernetes#73171

Closed

pires mentioned this issue Jun 1, 2019

VK node resource is deleted when CCM is enabled virtual-kubelet/virtual-kubelet#321

Open

andrewsykim added P3 Priority 3 P2 Priority 2 and removed P3 Priority 3 labels Jul 10, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2019

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 16, 2019

andrewsykim modified the milestones: Next, v1.19 Apr 15, 2020

randomvariable mentioned this issue Dec 3, 2021

CPV should not remove nodes coming from a different provider kubernetes/cloud-provider-vsphere#538

Closed

lubronzhan mentioned this issue Dec 16, 2022

cloud-provider-vsphere should ignore nodes with providerID prefix != 'vsphere://' kubernetes/cloud-provider-vsphere#677

Closed

olemarkus mentioned this issue Dec 27, 2022

Support for multi-regional kubernetes clusters kubernetes-sigs/aws-ebs-csi-driver#1402

Closed

james-callahan mentioned this issue Jan 12, 2023

Many CCM in one cluster. #63

Closed

mdbooth mentioned this issue Apr 24, 2023

[occm] Ability to delay node removal kubernetes/cloud-provider-openstack#2213

Closed

mddamato mentioned this issue Feb 14, 2024

Ignore nodes from other providers kubernetes/autoscaler#6358

Closed

asviel mentioned this issue Feb 16, 2024

[cloud-provider-aws] Cannot use Static nodes in a cluster where CCM is running deckhouse/deckhouse#7543

Open

2 tasks

boedy mentioned this issue Apr 17, 2024

feat(load-balancer): ignore nodes that don't use known provider IDs hetznercloud/hcloud-cloud-controller-manager#530

Closed

sergelogvinov mentioned this issue May 12, 2024

node-lifecycle-controller: Wait uninitialized nodes kubernetes/kubernetes#124825

Closed

Atoms mentioned this issue Jan 8, 2025

Running hybrid cluster sergelogvinov/proxmox-cloud-controller-manager#167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud-controller-manager should be able to ignore nodes #35

cloud-controller-manager should be able to ignore nodes #35

andrewsykim commented May 28, 2019

fejta-bot commented Oct 8, 2019

andrewsykim commented Oct 16, 2019

cheftako commented Oct 16, 2019

aoxn commented Feb 14, 2020 •

edited

Loading

andrewsykim commented Apr 15, 2020

timoreimann commented Apr 28, 2020

morenweld commented Jul 21, 2020

pires commented Jan 28, 2021

twexler commented Mar 17, 2022 •

edited

Loading

olemarkus commented Mar 17, 2022

twexler commented Mar 18, 2022

olemarkus commented Mar 18, 2022

twexler commented Mar 18, 2022

zhaodiaoer commented Aug 12, 2022

james-callahan commented Jan 12, 2023

olemarkus commented Jan 12, 2023

sergelogvinov commented Jan 12, 2023

james-callahan commented Jan 12, 2023 •

edited

Loading

sergelogvinov commented Jan 12, 2023

olemarkus commented Jan 12, 2023

james-callahan commented Jan 12, 2023

olemarkus commented Jan 12, 2023

goraxe commented Jun 6, 2024

chamarakera commented Nov 27, 2024 •

edited

Loading

cloud-controller-manager should be able to ignore nodes #35

cloud-controller-manager should be able to ignore nodes #35

Comments

andrewsykim commented May 28, 2019

fejta-bot commented Oct 8, 2019

andrewsykim commented Oct 16, 2019

cheftako commented Oct 16, 2019

aoxn commented Feb 14, 2020 • edited Loading

andrewsykim commented Apr 15, 2020

timoreimann commented Apr 28, 2020

morenweld commented Jul 21, 2020

pires commented Jan 28, 2021

twexler commented Mar 17, 2022 • edited Loading

olemarkus commented Mar 17, 2022

twexler commented Mar 18, 2022

olemarkus commented Mar 18, 2022

twexler commented Mar 18, 2022

zhaodiaoer commented Aug 12, 2022

james-callahan commented Jan 12, 2023

olemarkus commented Jan 12, 2023

sergelogvinov commented Jan 12, 2023

james-callahan commented Jan 12, 2023 • edited Loading

sergelogvinov commented Jan 12, 2023

olemarkus commented Jan 12, 2023

james-callahan commented Jan 12, 2023

olemarkus commented Jan 12, 2023

goraxe commented Jun 6, 2024

chamarakera commented Nov 27, 2024 • edited Loading

aoxn commented Feb 14, 2020 •

edited

Loading

twexler commented Mar 17, 2022 •

edited

Loading

james-callahan commented Jan 12, 2023 •

edited

Loading

chamarakera commented Nov 27, 2024 •

edited

Loading