Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing file cluster-info-discovery-kubeconfig.yaml for role kubeadm #11835

Open
ledroide opened this issue Dec 27, 2024 · 5 comments
Open

Missing file cluster-info-discovery-kubeconfig.yaml for role kubeadm #11835

ledroide opened this issue Dec 27, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ledroide
Copy link
Contributor

ledroide commented Dec 27, 2024

What happened?

summary

Some worker nodes do not create the file cluster-info-discovery-kubeconfig.yaml, which is expected further in role kubeadm.

Running playbook cluster.yml fails at step Create kubeadm client config defined in roles/kubernetes/kubeadm/tasks/main.yml with this error for 3 worker nodes from a pool of 5:

Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk

When checking on hosts, those that succeed have the expected /etc/kubernetes/cluster-info-discovery-kubeconfig.yaml. Nodes that fail don't.

environment

  • Kubespray version (commit): a6bc327 - Thu Dec 26 21:24:11 2024
  • Network plugin: cilium
  • Container runtime and engine: cri-o + crun
  • OS: Ubuntu Cloud 24.04 Minimal
  • Ansible: 2.16.14
  • Python: 3.12.3
  • Playbook: cluster.yml

output for role kubeadm

TASK [kubernetes_sigs.kubespray.kubernetes/kubeadm : Set kubeadm_token to generated token] **********************************************************************
skipping: [k8ststmaster-1]
skipping: [k8ststmaster-2]
skipping: [k8ststmaster-3]
ok: [k8ststworker-1]
ok: [k8ststworker-2]
ok: [k8ststworker-3]
ok: [k8ststworker-4]
ok: [k8ststworker-5]

TASK [kubernetes_sigs.kubespray.kubernetes/kubeadm : Get kubeconfig for join discovery process] *****************************************************************
changed: [k8ststmaster-1]

TASK [kubernetes_sigs.kubespray.kubernetes/kubeadm : Copy discovery kubeconfig] *********************************************************************************
skipping: [k8ststmaster-1]
skipping: [k8ststmaster-2]
skipping: [k8ststmaster-3]
skipping: [k8ststworker-1]
skipping: [k8ststworker-2]
skipping: [k8ststworker-3]
skipping: [k8ststworker-4]
skipping: [k8ststworker-5]

TASK [kubernetes_sigs.kubespray.kubernetes/kubeadm : Create kubeadm client config] ******************************************************************************
skipping: [k8ststmaster-1]
skipping: [k8ststmaster-2]
skipping: [k8ststmaster-3]
fatal: [k8ststworker-1]: FAILED! => {"changed": false, "checksum": "1181ae102a3ef189abcc93f8b686f14df2c2379d", "exit_status": 3, "msg": "failed to validate", "stderr": "discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk\nTo see the stack trace of this error execute with --v=5 or higher\n", "stderr_lines": ["discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []}
fatal: [k8ststworker-2]: FAILED! => {"changed": false, "checksum": "fda08bfcc11befe3efd50b8794c32974ba5e3f38", "exit_status": 3, "msg": "failed to validate", "stderr": "discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk\nTo see the stack trace of this error execute with --v=5 or higher\n", "stderr_lines": ["discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []}
fatal: [k8ststworker-3]: FAILED! => {"changed": false, "checksum": "58e81ca97c49ff42a38be613678b2943e085f979", "exit_status": 3, "msg": "failed to validate", "stderr": "discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk\nTo see the stack trace of this error execute with --v=5 or higher\n", "stderr_lines": ["discovery.file.kubeConfigPath: Invalid value: \"/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml\": not a valid HTTPS URL or a file on disk", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "", "stdout_lines": []}
changed: [k8ststworker-5]
changed: [k8ststworker-4]

On k8ststworker-1 - that fails :

$ sudo ls /etc/kubernetes/cluster-info-discovery-kubeconfig.yaml
ls: cannot access '/etc/kubernetes/cluster-info-discovery-kubeconfig.yaml': No such file or directory

On k8ststworker-5 - that is ok :

$ sudo cat /etc/kubernetes/cluster-info-discovery-kubeconfig.yaml
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS_[_snip_]_S0tLS0K
    server: https://144.98.xxx.xxx:6443
  name: ""
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null

additional info

  • Except their names and IP addresses, there is no difference between worker nodes : they are identical.
  • I have checked that all worker nodes can reach the kube-apiserver on port 6443
  • I was surprised to realize that there is not even one tag in the role kubeadm ; so I have no simple way to execute this particular task for testing. Unfortunately, the failing step is happening near the end of the cluster.yml playbook, that means testing any fix could be very long.
@ledroide ledroide added the kind/bug Categorizes issue or PR as related to a bug. label Dec 27, 2024
@tico88612
Copy link
Member

tico88612 commented Dec 27, 2024

@ledroide Could you provide your inventory and vars?

Please follow our issue template to report bugs.

that why we disabled creating the blank issue in GitHub UI :)

@ledroide
Copy link
Contributor Author

@tico88612 : Here below are inventory variables. Hope it helps.

kubernetes_audit: true
kube_encrypt_secret_data: true
remove_anonymous_access: true
cilium_version: v1.16.5
cilium_kube_proxy_replacement: false
cilium_cni_exclusive: true
cilium_encryption_enabled: true
cilium_encryption_type: wireguard
cilium_tunnel_mode: vxlan
cilium_enable_bandwidth_manager: true
cilium_enable_hubble: true
cilium_enable_hubble_ui: true
cilium_hubble_install: true
cilium_hubble_tls_generate: true
cilium_enable_hubble_metrics: true
cilium_hubble_metrics:
  - dns
  - drop
  - tcp
  - flow
  - icmp
  - http
cilium_enable_host_firewall: true
cilium_policy_audit_mode: false
kubeconfig_localhost: true
system_reserved: true
kubelet_max_pods: 280
kubelet_systemd_wants_dependencies: ["rpc-statd.service"]
kube_network_node_prefix: 23
kube_network_node_prefix_ipv6: 120
kube_network_plugin: cilium
container_manager: crio
crun_enabled: true
kube_proxy_strict_arp: true
resolvconf_mode: host_resolvconf
upstream_dns_servers: [213.186.33.99]
serial: 2    # how many nodes are upgraded at the same time
unsafe_show_logs: true    # when need to debug kubespray output
metrics_server_enabled: true
metrics_server_replicas: 3
metrics_server_limits_cpu: 400m
metrics_server_limits_memory: 600Mi
metrics_server_metric_resolution: 20s
local_path_provisioner_enabled: true
local_path_provisioner_is_default_storageclass: "false"
local_path_provisioner_helper_image_repo: docker.io/library/busybox
ingress_nginx_enabled: true
ingress_nginx_host_network: true
ingress_nginx_class: nginx
csi_snapshot_controller_enabled: true
cert_manager_enabled: true
cephfs_provisioner_enabled: false
argocd_enabled: false
etcd_deployment_type: host
crio_enable_metrics: true
nri_enabled: true
download_container: false
skip_downloads: false

@tico88612
Copy link
Member

I have tested on my environment, I think remove_anonymous_access have some problems.

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 2, 2025
@ledroide
Copy link
Contributor Author

ledroide commented Jan 2, 2025

@tico88612 : Thanks for the clue - which now looks like an evidence. I have removed the hardening variable remove_anonymous_access from my inventory, and then succeeded to run fully the playbook cluster.yml.

I will follow-up #11842 issue, and try again with remove_anonymous_access: true as soon as it is supposed to be fixed.

@schaenzer
Copy link

@tico88612 I have currently run into the same problem. As soon as the worker node exists and then remove_anonymous_access: true is set, the playbook aborts.

When creating [1] the cluster-info-discovery-kubeconfig.yaml it checks if the kubelet.conf already exists [2]. This is the case for an already functioning node and the file is not created, for a new one this check passes and the file is created.

Possible solution would be to check if the cluster-info-discovery-kubeconfig.yaml exists instead of relying on the check on the existence of kubelet.conf

[1]

- name: Copy discovery kubeconfig
copy:
dest: "{{ kube_config_dir }}/cluster-info-discovery-kubeconfig.yaml"
content: "{{ kubeconfig_file_discovery.stdout }}"
owner: "root"
mode: "0644"
when:
- ('kube_control_plane' not in group_names)
- not kubelet_conf.stat.exists
- kubeadm_use_file_discovery

[2]
- name: Check if kubelet.conf exists
stat:
path: "{{ kube_config_dir }}/kubelet.conf"
get_attributes: false
get_checksum: false
get_mime: false
register: kubelet_conf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants