Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Keep Windows Container Agents or embrace Windows VM agents #4554

Open
dduportal opened this issue Feb 21, 2025 · 2 comments
Open
Assignees

Comments

@dduportal
Copy link
Contributor

dduportal commented Feb 21, 2025

Following #4318 and the migration of ci.jenkins.io to AWS, it looks like that the EKS Windows container agents are not working as expected.

Our tests were successfully, but in real condition (multiple builds of multiple plugins at the same time) shows a lot of memory management errors: #4552.

Short term: we are reverting back from using containers to VMs for Windows agents. Rationale is that the scope of projects impacted by build or tests failures is way less important with Windows VM agents in the current state.

Medium term: additional work is required to fix the failing plugin builds. It includes the remoting" component which is really important.

Long term: we have to reconsider even using Windows containers for agents. It was a useful technique years ago to provide Windows agents (when using ACI) but Windows VMs are easier to operate (for the same cost as containers), and are even faster.

We are facing the following problems regarding Windows container agents:

wip

@jglick
Copy link

jglick commented Feb 21, 2025

The performances of containers are not that good: need to spin up a node (5 to 9 min. currently) , then pull the image

Only if there has been no recent request for a Windows container agent, surely? Because you are running a Windows node pool that can scale to zero? Once a node is running, it ought to be able to create fresh pods very quickly.

@dduportal
Copy link
Contributor Author

The performances of containers are not that good: need to spin up a node (5 to 9 min. currently) , then pull the image

Only if there has been no recent request for a Windows container agent, surely? Because you are running a Windows node pool that can scale to zero? Once a node is running, it ought to be able to create fresh pods very quickly.

If the node was statically sized to handle multiple nodes (so not with Karpenter which select optimized instance for the current workload when required). Then it moves to the billing part: either you have a probability to have a pod coming up quickly (e.g. keep the node running a few minute but it costs more) with the same image (e.g. no additional pull) or the node is recycled and you're back to step 0.
=> it is a non deterministic cycle and even ci.jenkins.io does not a huge the critical mass for windows case only.

However these elements could be optimized (windows nodes fast to start, unifying the container images, precaching the container image, etc.) with some effort. The main concern is around the Windows memory management. Not sure why we had these OOM while having huge machines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants