You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're running Calico on an AWS EKS cluster with both Windows and Linux nodes. When running long running downloads via HTTP(s), we observe connection timeouts, which are caused by missing ACK packages from the client. The client is running on windows nodes inside the cluster.
It does not matter, whether the server receiving the requests is outside of the cluster or on Linux nodes, the behavior is the same.
Any help on this one is greatly appreciated, I'm uncertain whether this problem is caused by Calico or something else.
Expected Behavior
Downloads of files should succeed and TCP packages should be ACKnowledged.
Current Behavior
As outlined above, when downloading a larger file i.e. 200MB (the client does not matter, we tested curl.exe, C# code, or Invoke-WebRequest), the download will first start (although slow) and after a while (observed times between 30s and 3minutes) the progress will stall. Ultimately the connection will be forcibly closed by the server as the client does not read the rest of the response body.
When capturing the traffic with netsh trace start capture=yes and netsh trace stop and analyzing the resulting trace with Wireshark, we can observe the lack of ACK packages at some point.:
192.168.35.164 is the IP of the server, in this case a pod running on a Linux node
172.20.121.29 is the IP of the client
192.168.114.23 is the IP of the pod in the VXLAN (from the calico IP Pool)
Additionally we observe high CPU usage on the node, but it's hard to pin down. According to some analysis the CPU usage is caused by containerd. But my analysis didn't yield any details. The issue persists even during times where the overall CPU usage is below 50% on the node.
Steps to Reproduce (for bugs)
EKS Cluster running Calico and both Windows and Linux nodes
HTTP server outside of the cluster / on a linux node
Download a file
Context
This affects file transfering applications running on Windows nodes, that run in windows containers due to legacy reasons.
Your Environment
Calico version 3.27.3
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes 1.29.7-eks-2f46c53
Operating System and version: Windows Server 2022 Datacenter (10.0.20348.2655)
Link to your project (optional): -
Edit: removed ambiguity in wording
The text was updated successfully, but these errors were encountered:
@Argannor this doesn't look like a problem with Calico itself, and the symptoms look different, but from the Windows version number you're using, maybe it could be related to this Windows containers bug microsoft/Windows-Containers#516 ?
Please take a look at this issue #9019 (Windows build 20348.2582 and up seems to break networking), there is a fix on this comment: #9019 (comment)
Thank you for looking into this @coutinhop. We're aware of the other issue and have applied the workarounds already, which resolved that issue. I also suspect that this is not an error with calico itself.
We decided that our way forward is to abandon Windows for containerized workloads and move everything to windows. So I'll close this issue. We really don't see the production readiness of Windows Containers in general. This includes the experience in the microsoft/Windows-Containers#516 issue you already linked above. This is not to dump on Calico, in fact we're very happy with Calico :)
We're running Calico on an AWS EKS cluster with both Windows and Linux nodes. When running long running downloads via HTTP(s), we observe connection timeouts, which are caused by missing
ACK
packages from the client. The client is running on windows nodes inside the cluster.It does not matter, whether the server receiving the requests is outside of the cluster or on Linux nodes, the behavior is the same.
Any help on this one is greatly appreciated, I'm uncertain whether this problem is caused by Calico or something else.
Expected Behavior
Downloads of files should succeed and TCP packages should be ACKnowledged.
Current Behavior
As outlined above, when downloading a larger file i.e. 200MB (the client does not matter, we tested
curl.exe
, C# code, orInvoke-WebRequest
), the download will first start (although slow) and after a while (observed times between 30s and 3minutes) the progress will stall. Ultimately the connection will be forcibly closed by the server as the client does not read the rest of the response body.When capturing the traffic with
netsh trace start capture=yes
andnetsh trace stop
and analyzing the resulting trace with Wireshark, we can observe the lack of ACK packages at some point.:192.168.35.164
is the IP of the server, in this case a pod running on a Linux node172.20.121.29
is the IP of the client192.168.114.23
is the IP of the pod in the VXLAN (from the calico IP Pool)Additionally we observe high CPU usage on the node, but it's hard to pin down. According to some analysis the CPU usage is caused by containerd. But my analysis didn't yield any details. The issue persists even during times where the overall CPU usage is below 50% on the node.
Steps to Reproduce (for bugs)
Context
This affects file transfering applications running on Windows nodes, that run in windows containers due to legacy reasons.
Your Environment
Edit: removed ambiguity in wording
The text was updated successfully, but these errors were encountered: