Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico and HCC #641

Closed
medicol69 opened this issue May 8, 2024 · 25 comments
Closed

Calico and HCC #641

medicol69 opened this issue May 8, 2024 · 25 comments
Labels
enhancement New feature or request stale

Comments

@medicol69
Copy link

TL;DR

This is more of an inquiry, since it's not that clear from the documentation, does the hetzner cloud controller work with the Calico CNI when using the private interfaces on Hetzner? Thanks

Expected behavior

this is an inquiry on the documentation.

@medicol69 medicol69 added the enhancement New feature or request label May 8, 2024
@apricote
Copy link
Member

When you use the private networks from Hetzner Cloud with hcloud-cloud-controller-manager and enable the routes-controller (default), then you should be able to use Calico without any additional overlay networks. You can configure this in Calico with CALICO_NETWORKING_BACKEND=none

I have never personally tested this configuration though.

@simonostendorf
Copy link
Contributor

I am also interested in this topic, if you have any knowledge @medicol69 please let me now :)

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 2, 2024

Yes, it works fine with calico. To run a quick test use hetzner-k3s.

Important warning when running cloud together when baremetal with private networking. Calico requires a /24 vlan address per node which means when you're creating a subnet make sure the vlan subnet is at minimum a /23 (1 nodes max) or ideally /17 (127 nodes max) allocating first half to cloud instances and the second half to baremetal instances.

@medicol69
Copy link
Author

thanks, but I don't think that the hetzner private network interfaces are stable enough to use them in production. If anyone got them to work and give out an example of how to use it in prod I'm all ears.

@DeprecatedLuke
Copy link

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

@simonostendorf
Copy link
Contributor

simonostendorf commented Jun 3, 2024

I am using calico without encapsulation and hccm with routes enabled. Calico uses BPF and replaces kube-proxy.

I think this works well, but I haven't tested it enough to be 100% sure.

If you have any feedback on this configuration, I would love to discuss it :)

calico-tigera-operator-values.yaml

installation:
  cni:
    type: Calico
    ipam:
      type: HostLocal # use podCIDR assigned by kube-controller-manager, that is also used by route-controller in hcloud-cloud-controller-manager
  calicoNetwork:
    bgp: Enabled
    linuxDataplane: BPF
    hostPorts: Disabled
    ipPools:
      - name: default-ipv4
        cidr: 10.0.0.0/16
        encapsulation: None
        blockSize: 24
        natOutgoing: Enabled
        nodeSelector: all()
defaultFelixConfiguration:
  enabled: true
  bpfEnabled: true
  bpfExternalServiceMode: DSR
  bpfKubeProxyIptablesCleanupEnabled: true
kubernetesServiceEndpoint:
  host: api.my-cluster.domain.tld
  port: 6443

@DeprecatedLuke
Copy link

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

I spent few hours trying to figure out why without being able to find any difference between the two configurations. My only guess is that it is some internal order of configuration where the metadata/private network endpoints are not being parsed in order.

So to recap: allocate at least /16 vlan range and do not use the hcloud controller (will not be able to use the load balancer or resolve labels automatically).

@simonostendorf
Copy link
Contributor

simonostendorf commented Jun 4, 2024

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

@medicol69
Copy link
Author

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

@simonostendorf
Copy link
Contributor

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

I am currently testing this. You can see my calico values above. HCCM configuration is normal with networks enabled.

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 4, 2024

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

I tried both 1.29 and 1.30, here's my init script:

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

EDIT: added node-ip=$PRIVATE_IP, the configuration before is what I am currently using to get around the issue.

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

@simonostendorf
Copy link
Contributor

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

I am using kubeadm only on hcloud nodes (currently no dedicated / robot nodes, maybe i will add them later) and this works fine.

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 4, 2024

Alright, here's the full guide to replicate the issue:
init_master.sh

#!/bin/bash

CLUSTER=$1
CLUSTER_DOMAIN=$2
SERVER_HOST=$3
CLUSTER_PRIVATE_NET=$4
CLUSTER_CIDR=$5
CLUSTER_SERVICE_CIDR=$6
CLUSTER_DNS=$7
CLUSTER_LB=$8

PUBLIC_IP=$(ssh $SERVER_HOST "curl checkip.amazonaws.com")
PRIVATE_IP=$(ssh $SERVER_HOST "ip route get $CLUSTER_PRIVATE_NET | awk '{print \$7}'")
PROVIDER_ID=$(ssh $SERVER_HOST "curl http://169.254.169.254/hetzner/v1/metadata/instance-id")

echo "Public IP: $PUBLIC_IP Private IP: $PRIVATE_IP"

kubectl config delete-cluster $CLUSTER
kubectl config delete-user $CLUSTER

SERVER_HOSTNAME=$(echo $SERVER_HOST | cut -d'.' -f1)

ssh -y $SERVER_HOST "curl https://packages.hetzner.com/hcloud/deb/hc-utils_0.0.4-1_all.deb -o /tmp/hc-utils_0.0.3-1_all.deb -s && apt -y install /tmp/hc-utils_0.0.3-1_all.deb"

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

kubectl config set-cluster $CLUSTER --server=https://$CLUSTER_LB:6443
k3sup ready --context $CLUSTER <- will fail since no CNI

bash init_master.sh test-cluster cluster.local IP_ADDRESS 10.224.0.0 10.222.0.0/16 10.223.0.0/16 10.223.0.10 IP_ADDRESS

kubectl config set-context test-cluster

Install calico:
helm repo add tiegra https://docs.tigera.io/calico/charts
helm repo update tiegra
helm install cni tiegra/tigera-operator -n tiegra-operator

Create HCCM secret with the network cidr and hcloud token.

Install hcloud:
helm repo add hcloud https://charts.hetzner.cloud
helm repo update hcloud
helm install hccm hcloud/hcloud-cloud-controller-manager -n kube-system --values values.yaml

nodeSelector:
  node-role.kubernetes.io/control-plane: "true"

Observe the following error:

error syncing '*node*': failed to get node modifiers from cloud provider: provided node ip for node "*node*" is not valid: failed to get node address from cloud provider that matches ip: 10.224.0.2, requeuing

edit: the actual name doesn't matter for the hostname since providerid is specified, usually the hostname would be a domain matching the name of the node and the calico step is optional.

@simonostendorf
Copy link
Contributor

If you see failed to get node address from cloud provider that matches ip: 10.x.x.x, requeuing you have to enable routes-controller with network.enabled=true.

@DeprecatedLuke
Copy link

If you see failed to get node address from cloud provider that matches ip: 10.x.x.x, requeuing you have to enable routes-controller with network.enabled=true.

Ah, that makes sense! You can't enable robot & network at the same time (refuses to start). However, if you change the label to get it to load it does work fine so it's still a weird limitation.

@simonostendorf
Copy link
Contributor

simonostendorf commented Jun 7, 2024

What needs to be done to enable route controllers with robot support?

Is this generally supported by the underlying network and does the support need to be implemented in the hccm or are there any changes required to the Hetzner Cloud network?

(see https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md#unsupported)

Edit: We can move this to a new issue if needed, I am interested in this feature and could try to implement (parts of) it.

@DeprecatedLuke
Copy link

What needs to be done to enable route controllers with robot support?

Is this generally supported by the underlying network and does the support need to be implemented in the hccm or are there any changes required to the Hetzner Cloud network?

(see https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md#unsupported)

Edit: We can move this to a new issue if needed, I am interested in this feature and could try to implement (parts of) it.

As far as I know the routes table in the network configuration is not compatible with vSwitch.

@simonostendorf
Copy link
Contributor

As far as I know the routes table in the network configuration is not compatible with vSwitch.

But I think it should be possible to use private ip addresses for the nodes (so this currently needs route controller enabled) and vswitch WITHOUT cidr routing.

@DeprecatedLuke
Copy link

As far as I know the routes table in the network configuration is not compatible with vSwitch.

But I think it should be possible to use private ip addresses for the nodes (so this currently needs route controller enabled) and vswitch WITHOUT cidr routing.

Yep, it's possible (with calico at least in VXLANCrossSubnet configuration). I've hacked it to recognize the nodes by setting the label alpha.kubernetes.io/provided-node-ip which was working for a short while before it got updated to the real one and broke pod scheduling.

@simonostendorf
Copy link
Contributor

As far as I know the routes table in the network configuration is not compatible with vSwitch.

I found the following configuration: https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/network#expose_routes_to_vswitch.
This tells me that routes should be possible with vSwitch connected servers.

@DeprecatedLuke
Copy link

As far as I know the routes table in the network configuration is not compatible with vSwitch.

I found the following configuration: https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/network#expose_routes_to_vswitch. This tells me that routes should be possible with vSwitch connected servers.

It's the option here: https://luk.cat/24/9L1WVG.png, but they are not assignable which is required for cni's to function: https://luk.cat/24/LJtyTr.png

@apricote
Copy link
Member

The main problem with Robot & Routing is, that there is no way to get the private IPs of the Robot server through the API (see #676 for an example).

IIUC there is also no way to have a Route with the Gateway being a private IP of a Robot server behind the vswitch.


It is possible to get the private IP info on the Cloud Servers without using the Routes feature. You need to set HCLOUD_NETWORK_ROUTES_ENABLED=false in the env variables. This will also work when enabling the Robot support.

@olexiyb
Copy link

olexiyb commented Aug 14, 2024

But is it possible to skip check for robot nodes?
I do have HCLOUD_NETWORK_ROUTES_ENABLED=false

These errors are very annoying

2024-08-14T14:38:04.532676333Z E0814 14:38:04.532366       1 node_controller.go:389] Failed to update node addresses for node "scd1": failed to get node address from cloud provider that matches ip: 10.100.0.2
2024-08-14T14:38:04.533375857Z E0814 14:38:04.533291       1 node_controller.go:389] Failed to update node addresses for node "scd2": failed to get node address from cloud provider that matches ip: 10.100.0.3
2024-08-14T14:38:04.534489902Z E0814 14:38:04.534347       1 node_controller.go:389] Failed to update node addresses for node "scd3": failed to get node address from cloud provider that matches ip: 10.100.0.4
2024-08-14T14:43:07.024394731Z E0814 14:43:07.022342       1 node_controller.go:389] Failed to update node addresses for node "scd2": failed to get node address from cloud provider that matches ip: 10.100.0.3
2024-08-14T14:43:07.024800413Z E0814 14:43:07.024527       1 node_controller.go:389] Failed to update node addresses for node "scd3": failed to get node address from cloud provider that matches ip: 10.100.0.4
2024

@apricote
Copy link
Member

But is it possible to skip check for robot nodes?

Which check are you talking about? Do you have Robot nodes in your cluster and robot.enabled: true (Helm) or ROBOT_ENABLED=true (Env Var) set in your deployment?

Copy link
Contributor

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

@github-actions github-actions bot added the stale label Nov 14, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants