Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running hybrid cluster #167

Closed
Atoms opened this issue Jan 7, 2025 · 4 comments
Closed

Running hybrid cluster #167

Atoms opened this issue Jan 7, 2025 · 4 comments

Comments

@Atoms
Copy link

Atoms commented Jan 7, 2025

Bug Report

Bare-metal node cannot join cluster if Proxmox CCM is used.

Description

Our Use Case requires us to use virtual machines and bare metal machines for our Kubernetes cluster deployments.

Logs

E0107 15:04:48.914152       1 node_controller.go:244] "Unhandled Error" err="error syncing 'lv01-k8s-gpu-node01': failed to get instance metadata for node lv01-k8s-gpu-node01: instances.InstanceMetadata() - failed to find instance by name/uuid lv01-k8s-gpu-node01: vm with uuid '00000000-0000-0000-0000-3cecef922302' not found, skipped, requeuing" logger="UnhandledError"

Environment

Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.30.6+rke2r1
# cat /etc/os-release
NAME="AlmaLinux"
VERSION="9.5 (Teal Serval)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.5 (Teal Serval)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.5"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"
SUPPORT_END=2032-06-01
  • Plugin version:
  • Kubernetes version: [kubectl version --short]
  • Node describe: [kubectl describe node <node>]
  • OS version [cat /etc/os-release]
@sergelogvinov
Copy link
Owner

Hello, could you please share your thoughts on how to approach the solution?

Based on the logs - not found, skipped, the CCM does nothing with this node.

@Atoms
Copy link
Author

Atoms commented Jan 8, 2025

I'm not sure what to do here, i'm not even sure it's CCMs fault that node cannot join, i'm shooting into dark.

So what i have found currently is this error, and in my head this could mean it cannot adjust kubelet config as it's needed for proxmox virtual machines with providerID: proxmox://promxox-host01/118

there is taint left there:

  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"

Which i suspect is there, cause it's not a cloud node but is a bare-metal one and CCM cannot finish it's job cause it's not finding node on proxmox hosts.

Also i don't know if it would work like this in hybrid mode, where just some nodes are managed by CCM.

Previously i got kubernetes in such hybrid model without CCM, but as i need proxmox-csi which needs that ProviderID magic to work i went now with proxmox CCM.

Currently behaviour what i see is - node is in ready state, but it has no role (even it was/is provisioned with worker role), I use RKE2 cluster, so rancher-system-agent sees that role, but later on kubelet looses it somewhere, and i suspect it's because CCM is not configuring this node.

What if some kinda label on node makes CCM to not try to configure this node?

and if such label exists it just skips it...

I see there is Kubernetes CCM issue with some similarities: kubernetes/cloud-provider#35 (i would say 100% match) and there is a comment about alibaba having label

alibaba cloudprovider use service.beta.kubernetes.io/exclude-node in node labels to exclude node from ccm.

So i would like to see something similar...

@sergelogvinov
Copy link
Owner

Join the bare-metal node without flag --cloud-provider=external Please read this part, how CCM/kubelet works https://github.com/sergelogvinov/proxmox-cloud-controller-manager/blob/main/docs/install.md#troubleshooting and https://dev.to/sergelogvinov/kubernetes-on-hybrid-cloud-cloud-controller-manager-ccm-jdn

Proxmox CCM does not change a node if it is not part of Proxmox VE. This helps the Kubernetes community create hybrid cloud setups.

If you notice that a self-hosted node disappears from the cluster, check the logs of all the cluster components (addons). Proxmox CCM will show a message like: "I am skipping this node completely." This behavior is supported only by Proxmox CCM and Azure. Other CCMs (as far as I know) will instantly remove the node from the cluster.

There are many issues and discussions about this. The short answer is: Kubernetes does NOT officially support hybrid clusters.

Proxmox CCM was designed to support multi-region, multi-zone, and hybrid clusters. It should work well — at least, it works in my production hybrid closter setups across different clouds. You can check my research here: https://github.com/sergelogvinov/terraform-talos

Also, consider joining or following the Kubernetes Hybrid Cloud community group: https://github.com/kubernetes-hybrid-cloud. This helps us improve the hybrid cloud experience for Kubernetes.

Thanks for your interest of hybrid cloud installations 👍

@Atoms
Copy link
Author

Atoms commented Jan 8, 2025

Thanks @sergelogvinov for assistance not in github, at last got my initial proxmox csi setup to work without CCM which means i can ditch CCM as it was not needed for anything else then for adding labels to nodes for CSI.

@Atoms Atoms closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants