-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use self-hosted runners to improve build performance #6
Conversation
6a4b6ce
to
53c7410
Compare
53c7410
to
cbb00b2
Compare
a99d2f9
to
3fa35e5
Compare
Thanks, @mariajgrimaldi! It’s good to know that the overall build time is similar between both tools. Regarding instance provisioning: Another thing that concerns me a bit are the Jenkins steps that haven’t been translated to GitHub Actions yet. Besides syntax checks, what other steps are still missing? About the action not supporting spot instances, I’m not sure if this worries me too much, and I’m not inclined to make it a determining factor in the final decision. Considering it’s not something we currently have implemented in Jenkins, and -based on what I’ve read- looks like while using spot instances could significantly help reduce costs, you also need a way to manage spot interruptions and could get affected performance wise. So It generally seems better suited for flexible actions (?) We could also look for another action in the list that supports spot instances or is expected to support them in the near future (not for this issue). I found a couple, but none of them seem very reliable. |
Thank you @magajh for your review!
So, with GHA, each build job requires a new instance. Each AWS EC2 instance costs X.Y$ per minute, depending on the type. On Jenkins, a build job provisions an instance, each with three idle processes to be used. Since we're using the same instance type, each instance costs the same. Let's say we trigger 6 builds 1 minute apart. GHA will provision 6 instances, while Jenkins will provision 2 (1 job for each idle process). Now, in Jenkins provisioning happens in the cluster, so that counts as part of the maintenance costs of the cluster itself. While in GHA, it occurs in the GH-hosted runners, so each minute counts in the billing ~4 minutes total (see GHA billing). We can use these formulas for computing the cost of using AWS EC2 for building, considering the situation above: GHA Jenkins That is a very rough guess. Does it make sense? This is ONLY to compare how GH compares to Jenkins in this situation. It should be taken as something other than a legitimate cost estimate. While Jenkins supports multiple builds on the same instance to leverage resources, the GH action I chose doesn't due to a design choice. Therefore, this might increase AWS when triggering 6 build jobs sequentially with a delay since, in GH, they'd all be sequential. We could change the EC2 instance type to be small enough but still efficient. That's why I tried building with Another difference in the setup I explained is that GHA will take longer when triggering multiple builds, like the building of 6 images above, since it needs to provision each instance. In contrast, after the first build, the provision time won't count in Jenkins.
Since each job provisions a new instance when using GHA, build times will remain the same even with multiple jobs running simultaneously. This doesn't happen with Jenkins, each job uses the same shared resources, so building time increases slightly. I tested these last three jobs on Jenkins to support that claim: 541, 542, 543 Compared to when using a single job: https://picasso.dedalo.edunext.co/job/PICASSO_V2/533/
Only syntax validations. To be sure, you can review the steps here: https://github.com/eduNEXT/dedalo-scripts/blob/main/jenkins/picasso_v2. Here's what's missing: https://github.com/eduNEXT/dedalo-scripts/blob/main/jenkins/picasso_v2#L152-L157. I'll try implementing this missing step today. |
Thank you, @mariajgrimaldi , for the detailed PR and the thorough explanation. This is really helpful! I do have a few follow-up questions: Impact of Spot Instances: You mentioned that spot instances could be used in Jenkins with additional plugins. How viable would it be to implement a similar solution in GHA to reduce costs? Are there any plans for the GHA action to support spot instances in the future? Scalability: How do Jenkins and GHA compare when scaling the number of build jobs beyond 6? What happens to times and costs if, for example, 20 jobs are run simultaneously? Future Improvements: Are there plans to improve the GHA action to support instance reuse or other mechanisms to optimize costs and time? Could we propose or contribute to those improvements? Thanks again for your work on this! |
Thank you for your review, @Alec4r!
Some work is going on to use launch templates that would support spot instance configurations: machulav/ec2-github-runner#65. However, it's been a work in progress since 2021, and there is no guarantee that it will be finished any time soon. I mentioned it since there was a proposal for Jenkins to use spot instances instead. Although I can't see how Picasso would support spot instances when the build cannot be interrupted, but we can discuss that later. As for now, the action doesn't support provisioning spot instances, only on demand.
Considering the formulas above, the estimated costs in a situation where 20 builds run simultaneously would be: GHA Jenkins If GHA if we have gh hosted runner minutes to spare from the account: 0.384 * (20 * 30)/60 = 3.84$ If not: Jenkins So, the total time building on an instance is less when there is overlap. Therefore, IN THEORY, costs should be lower only if this happens. In the worst-case scenario, all builds are sequential each running on a new instance, so there's no overlap. We're assuming the following for this estimates to make sense:
As I mentioned in my comment above, this is ONLY to compare how GH compares to Jenkins in this situation, it shouldn't be taken as a cost estimate.
So, this issue addresses the instance reuse issue: machulav/ec2-github-runner#4 (comment), but it still needs to be implemented. I don't know how much of an effort that would be. Anyway, I do think we can propose changes, although maintainers seem to take some time to review incoming PRs: https://github.com/machulav/ec2-github-runner/pulls, and the why is concerning machulav/ec2-github-runner#172. This other action is relatively similar: https://github.com/NextChapterSoftware/ec2-action-builder with a faster maintainers response https://github.com/NextChapterSoftware/ec2-action-builder/pulls?q=is%3Apr+is%3Aclosed, although it's not as matured (was created on January of this year). In any case, we have limited options so I'll test it either way to compare. |
@eduNEXT/dedalo: I discovered while testing that this action doesn't support fine-grained PAT, so that's another issue to consider, here's a PR implementing it: machulav/ec2-github-runner#196 besides what I mentioned in my last comment:
Therefore, as I mentioned, I'll be testing another action instead https://github.com/NextChapterSoftware/ec2-action-builder. Build performance should remain the same since we're only changing how the runner's provision is done. Provisioning time might change, though. I'll let you know. |
Thanks @mariajgrimaldi.
|
Thank you for the review, @MaferMazu!
The idea of making the workflow public for everyone to consume comes from this: https://docs.github.com/en/actions/creating-actions/sharing-actions-and-workflows-from-your-private-repository#about-github-actions-access-to-private-repositories
What's different about this approach is how many instances are provisioned while building multiple images, ie. concurrently. While Jenkins supports multiple builds on the same instance to leverage resources, those GH actions don't support it out of the box. Therefore, for 6 different builds triggered 1 minute apart, GH will provision a new instance for each (sequential build), while Jenkins will provision 2 (1 job for each idle process).
Yes, the build time is similar. What could change is provisioning time for the runners.
In the document, I presented the monthly billing for GH with GitHub-hosted runners and Jenkins with self-hosted runners. Considering that, the costs for the workflow using self-hosted runners change. The monthly cost for Jenkins in July was ~$22, meaning that using EC2 instances to build images costs $22. This is comparable to this setup's cost if we have GH runners minutes to spare from the main account. If not, we'll have to add those additional costs: EC2_INSTANCE_COST_JULY = 22$ 22 + 66 * 4 * 0.008 = 24.112$ This cost calculation is an estimate. EC2_INSTANCE_COST_JULY could be higher when using GHA since all builds are sequential.
I'll let you know when I have the correct setup to test again. Thanks for the patience!
I was looking for a GHA option during implementation to replicate what Jenkins currently does, but I couldn't find any. However, we could implement something similar to: https://github.com/orgs/community/discussions/27097#discussioncomment-3254611, in our specific case the analyze error script would read the output and find the keywords we're looking for. |
Thank you for all your comments, @magajh @Alec4r @MaferMazu. I really appreciate it! Here's where we currently stand:
Both actions work well, but These are the options I see, considering what I mentioned about the actions I tested:
I don't think we'll find a perfect fit for what we're looking for, so we must commit to a decision and work from there. To make a decision, we should consider:
If the answer is that we don't have the resources or the time, then we should stop the effort altogether. I like this setup, but I can't see a way of committing to this solution without compromising on one of the first 3 options I mentioned. Please let me know if you have any other ideas for moving forward. Thank you. |
@mariajgrimaldi thanks for all the work and thorough testing that's gone into this Making this decision isn't easy, both tools have their pros and cons. However, we need to move forward. Here are my thoughts: Considering that:
Given these points, I think we should move forward with GitHub. Here are some next steps we could follow:
I’m also keen to hear the thoughts and opinions of others who are reviewing this PR |
Thank you, @magajh. I agree. Here's what I've been thinking we could do immediately:
Then, we'll be able to follow up with what Maga suggested. However, if adopting the workflow solution without immediately addressing the increase in building time is a big issue, as Maga mentioned, we should decide which action to use after merging PR#1. This consists of choosing which security concern to handle for provisioning self-hosted runners (classic PATs or Fine-grained PATs with admin permissions) while we find a better solution. @Alec4r: We discussed having a GH application with limited access for fine-grained PATs with admin permissions. Can we create one for this? Although IMO, that's a more considerable risk than using a classic PAT. So, can you help us create a classic PAT from a service account? If those are still too risky, as I mentioned previously, we should commit to the time increase for now. |
I believe it’s better to avoid using classic PATs for several reasons:
In summary, I believe the best long-term solution is to implement fine-grained PATs through a GitHub application, despite the initial setup effort this may require. This approach minimizes security risks and optimizes token management. |
I'll convert this to a draft until we can pick it up again. Thanks! |
So this PR was closed by accident... |
Description
This PR implements two GH jobs: provisioning and stopping self-hosted runners used to build docker images for the Open edX ecosystem. Each runner is provisioned as an AWS EC2 instance with enough resources to improve building performance without impacting the billing. In this current setup, a new instance is provisioned every time a new build workflow is executed.
For provisioning EC2 AWS instances, I chose to use the https://github.com/marketplace/actions/on-demand-self-hosted-aws-ec2-runner-for-github-actions action for the number of stars, the documentation available and because it's listed on the awesome-runners comparison table.
You can review here the configuration needed to use this workflow with self-hosted runners: https://github.com/eduNEXT/ednx-strains/blob/MJG/self-hosted-runners/.github/workflows/build.yml
Jenkins vs GHA
Setup
The main difference in setup relies on how many instances are provisioned per building job. While Jenkins provisions a single instance for a new build, if an instance with enough idle processes already exists (max number: 3) then an idle process is used for the next build to leverage resources. GHA provisions a new instance for each build job instead. So, while Jenkins uses 2 instances for 6 build jobs, GHA uses 6, increasing costs when building multiple images concurrently. I tried using m5.xlarge instead of m6i.2xlarge, which is the Jenkins agents default instance type, but the performance was impacted considerably: https://github.com/eduNEXT/ednx-strains/actions/runs/10479622786 since it's similar to the github hosted runner resources.
Also, while our Jenkins setup might support spot instances with additional plugins (ref), the action chosen for this implementation does not support it for the time being: machulav/ec2-github-runner#5
In this current setup, we're using the same instance type as Jenkins agents unless indicated otherwise.
Time comparison
Provision time
Provisioning time varies between the two. The process of provisioning an instance with Jenkins takes about ~2min, time I measured by hand because I did not find any data for this in Jenkins, while in GHA, it takes about ~4min actions/runs/10458288173 (see usage > Start self-hosted EC2 runner). See other builds for comparison actions/runs/10479549584, actions/runs/10479182910, actions/runs/10475792028
Build time
For building openedx images, the build step time in GHA takes about ~25 min, a similar time to this build stage in Jenkins with the same parameters. Those tests were executed with this simple configuration file: redwood/base/config.yml. For the latest redwood image redwood/base/config.yml, both take about the same time as well. You can go ahead and review other builds on Jenkins and GHA for comparison.
For building MFE images, the build step time in GHA takes about ~19 min, a similar time to the ~20min this build stage in Jenkins with the same parameters. Those tests were executed with this simple configuration file:
Total execution time:
The total execution recorded in GHA for building openedx images is 30m 59s actions/runs/10458288173/usage, while on Jenkins is ~31 min from “scheduled” to “completion” https://picasso.dedalo.edunext.co/job/PICASSO_V2/526/
The total execution recorded in GHA for building MFE images is 23m 36s actions/runs/10479182910/usage, while on Jenkins is 25min https://picasso.dedalo.edunext.co/job/PICASSO_V2/531/.
We must consider that not all jenkins steps have their corresponding translation in GHA, for example to do syntax checks, then GHA could take a little longer.
In this spreadsheet you can better review the time comparison across a considerable number of builds: https://docs.google.com/spreadsheets/d/1PtiRz4OBj3XxkY3xUUAi827vFtytWrox6ztK6XaqGkY/edit?usp=sharing
How to test
build.yml
workflow actively uses thepicasso/build.yml
workflow and sets the necessary configurations to run correctly.ednx-strains/.github/build.yml
workflow execution, go to Actions > Build Open edX strain, fill in the necessary configuration, or use the default. For self-hosted runners, use the workflow from the branchMJG/self-hosted-runners
, also use theMJG/redwood-test-image
strain branch to avoid overriding the current image on dockerhub.You won't be able to test until we generate a fine-grained PAT for cloning ednx-strain.
See here a successful build: https://github.com/eduNEXT/ednx-strains/actions/runs/10479622786