Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows CNI broken after latest EKS image update #9043

Closed
davidgiga1993 opened this issue Jul 22, 2024 · 11 comments
Closed

Windows CNI broken after latest EKS image update #9043

davidgiga1993 opened this issue Jul 22, 2024 · 11 comments

Comments

@davidgiga1993
Copy link
Contributor

We're using EKS on AWS with Calcio VXlan.

After updating the node image from ami-05b4e05d429e7759b (Windows_Server-2022-English-Core-EKS_Optimized-1.29-2024.06.17)
to ami-0f11d4c28a09d26d2 (Windows_Server-2022-English-Core-EKS_Optimized-1.29-2024.07.10)

it is not possible anymore to reach any IP anymore:

new-object System.Net.Sockets.TcpClient("172.20.0.1", 443)
new-object : Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 172.20.0.1:443"

Expected Behavior

The IPs should be reachable

Current Behavior

No IPs are reachable from inside the container, on the node itself (and host containers) network communication works fine.

Possible Solution

Steps to Reproduce (for bugs)

  1. Deploy EKS with the latest windows AMI
  2. Deploy calico
  3. Deploy dummy pod
  4. Communication isn't working

Context

Downgrading the AMI resolves the issue, thus I suspect it's somehow related to the CVE-2024-5321 as this was (according to amazon) the only change in this image.
Maybe related to #9019

Your Environment

  • Calico version v3.28.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): EKS 1.30
  • Operating System and version: Windows Server 2022 Core
  • Link to your project (optional):
@coutinhop
Copy link
Member

@davidgiga1993 can you confirm if you have Windows patch KB5040437 installed?
There's currently a known issue going on with this Windows update (not caused by Calico):
microsoft/Windows-Containers#516
kubernetes/test-infra#33042

@davidgiga1993
Copy link
Contributor Author

@davidgiga1993 can you confirm if you have Windows patch KB5040437 installed? There's currently a known issue going on with this Windows update (not caused by Calico): microsoft/Windows-Containers#516 kubernetes/test-infra#33042

Yes I can confirm. I just hope it's related to the Windows update issue as my error message differs from the ones reported by others

@avin3sh
Copy link

avin3sh commented Jul 29, 2024

Do you want to share the behavior observed with your pods in the Windows-Containers issue linked about so that folks at Microsoft are aware of various ways this is affecting the Pod behavior (and that this is widespread) ?

@JamesKehr
Copy link

@davidgiga1993 Please follow these steps and let me know if it resolves the issue with the July or August update installed.

  1. Open regedit (Registry Editor).
  2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State
  3. Add or update the following value to the State key:

Name : FwPerfImprovementChange
Type : DWORD
Value : 0

  1. Reboot [required].
  2. Test

@davidgiga1993
Copy link
Contributor Author

I'll try on Monday, however I'm not sure I can actually reboot the machine as the autoscaling group will detect the node as dead and remove/terminate it.
But I'll try, maybe I can set it during boot

@avin3sh
Copy link

avin3sh commented Aug 30, 2024

@JamesKehr you might want to share this in microsoft/Windows-Containers#516 -- there a lot more folks subscribed there with different configurations

@ilueckel
Copy link

@davidgiga1993 Please follow these steps and let me know if it resolves the issue with the July or August update installed.

1. Open regedit (Registry Editor).

2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State

3. Add or update the following value to the State key:

Name : FwPerfImprovementChange Type : DWORD Value : 0

4. Reboot [required].

5. Test

I'm not on AWS/EKS, but self hosted Rancher+rke2. This worked for me

@JamesKehr
Copy link

JamesKehr commented Aug 30, 2024

@avin3sh Done! Thanks for the tip!

@ilueckel Thank you for the confirmation!

@davidgiga1993 You will likely need to work with AWS support to make that change. That registry value is read when the HNS service starts.

You can try, but no guarantees, to set the reg value, stop all the k8s/Calico containers and services, restart the Host Networking Service (HNS) in Windows, and then fire everything back up. Assuming you have that level of control over the node, that might work.

Please let me know either way.

@Argannor
Copy link

@JamesKehr following your comment with the restart less fix (in the windows container issue) we were able to apply the hotfix on EKS with Calico and the networking works again.

(I'm working together with @davidgiga1993)

@coutinhop
Copy link
Member

@JamesKehr @Argannor @davidgiga1993 thanks for the fix and the updates, closing this now.

@nabbdl
Copy link

nabbdl commented Jan 15, 2025

Hi @davidgiga1993 . It seems that you were able to deploy EKS with Linux and Windows Node using Calico Overlay. I'm a bit curious how you achieve this as the documentation is sometimes ambiguous. On my side, Windows pods cannot consume IP addresses from Calico IPAM. Your tip will be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants