[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

dduportal · 2025-02-21T15:21:37Z

Following #4318 and the migration of ci.jenkins.io to AWS, it looks like that the EKS Windows container agents are not working as expected.

Our tests were successfully, but in real condition (multiple builds of multiple plugins at the same time) shows a lot of memory management errors: #4552.

Short term: we are reverting back from using containers to VMs for Windows agents. Rationale is that the scope of projects impacted by build or tests failures is way less important with Windows VM agents in the current state.

Medium term: additional work is required to fix the failing plugin builds. It includes the remoting" component which is really important.

Long term: we have to reconsider even using Windows containers for agents. It was a useful technique years ago to provide Windows agents (when using ACI) but Windows VMs are easier to operate (for the same cost as containers), and are even faster.

We are facing the following problems regarding Windows container agents:

The operational cost in AWS (compared to Azure) is too high.
- Kubernetes Nodes do not behave the same between AWS and Azure when running JVM builds
- Observability requires additional effort (see https://docs.datadoghq.com/agent/troubleshooting/windows_containers/#mixed-clusters-linux--windows for instance)
- Check the permissions and set up of Windows Node Pools in [ci.jenkins.io] Move ACI agents to ephemeral Windows containers to AWS #4318
- We have to improve performances with additional work (fast launch, custom node AMI, etc.) as per
The performances of containers are not that good: need to spin up a node (5 to 9 min. currently) , then pull the image (

wip

jglick · 2025-02-21T19:32:03Z

The performances of containers are not that good: need to spin up a node (5 to 9 min. currently) , then pull the image

Only if there has been no recent request for a Windows container agent, surely? Because you are running a Windows node pool that can scale to zero? Once a node is running, it ought to be able to create fresh pods very quickly.

dduportal · 2025-02-22T07:34:59Z

The performances of containers are not that good: need to spin up a node (5 to 9 min. currently) , then pull the image

Only if there has been no recent request for a Windows container agent, surely? Because you are running a Windows node pool that can scale to zero? Once a node is running, it ought to be able to create fresh pods very quickly.

If the node was statically sized to handle multiple nodes (so not with Karpenter which select optimized instance for the current workload when required). Then it moves to the billing part: either you have a probability to have a pod coming up quickly (e.g. keep the node running a few minute but it costs more) with the same image (e.g. no additional pull) or the node is recycled and you're back to step 0.
=> it is a non deterministic cycle and even ci.jenkins.io does not a huge the critical mass for windows case only.

However these elements could be optimized (windows nodes fast to start, unifying the container images, precaching the container image, etc.) with some effort. The main concern is around the Windows memory management. Not sure why we had these OOM while having huge machines

dduportal added this to the infra-team-sync-2025-02-25 milestone Feb 21, 2025

dduportal assigned dduportal and smerle33 Feb 21, 2025

This was referenced Feb 21, 2025

Revert "feat(aws.ci.jenkins.io): enabling windows container and remove labels from ec2 VMs" jenkins-infra/jenkins-infra#3912

Merged

[ci.jenkins.io] Paging file too small reported for config-file-provider and email-ext plugin builds #4552

Closed

jglick mentioned this issue Feb 21, 2025

Implement afterDisconnect in SimpleCommandLauncher jenkinsci/jenkins-test-harness#922

Merged

smerle33 modified the milestones: infra-team-sync-2025-02-25, infra-team-sync-2025-03-04, infra-team-sync-2025-03-11 Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

dduportal commented Feb 21, 2025 •

edited

Loading

jglick commented Feb 21, 2025

dduportal commented Feb 22, 2025

[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

Comments

dduportal commented Feb 21, 2025 • edited Loading

jglick commented Feb 21, 2025

dduportal commented Feb 22, 2025

dduportal commented Feb 21, 2025 •

edited

Loading