Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use self-hosted runners to improve build performance #6

Closed
wants to merge 4 commits into from

Conversation

mariajgrimaldi
Copy link
Contributor

@mariajgrimaldi mariajgrimaldi commented Aug 19, 2024

Description

This PR implements two GH jobs: provisioning and stopping self-hosted runners used to build docker images for the Open edX ecosystem. Each runner is provisioned as an AWS EC2 instance with enough resources to improve building performance without impacting the billing. In this current setup, a new instance is provisioned every time a new build workflow is executed.

For provisioning EC2 AWS instances, I chose to use the https://github.com/marketplace/actions/on-demand-self-hosted-aws-ec2-runner-for-github-actions action for the number of stars, the documentation available and because it's listed on the awesome-runners comparison table.

You can review here the configuration needed to use this workflow with self-hosted runners: https://github.com/eduNEXT/ednx-strains/blob/MJG/self-hosted-runners/.github/workflows/build.yml

Jenkins vs GHA

Setup

The main difference in setup relies on how many instances are provisioned per building job. While Jenkins provisions a single instance for a new build, if an instance with enough idle processes already exists (max number: 3) then an idle process is used for the next build to leverage resources. GHA provisions a new instance for each build job instead. So, while Jenkins uses 2 instances for 6 build jobs, GHA uses 6, increasing costs when building multiple images concurrently. I tried using m5.xlarge instead of m6i.2xlarge, which is the Jenkins agents default instance type, but the performance was impacted considerably: https://github.com/eduNEXT/ednx-strains/actions/runs/10479622786 since it's similar to the github hosted runner resources.

Also, while our Jenkins setup might support spot instances with additional plugins (ref), the action chosen for this implementation does not support it for the time being: machulav/ec2-github-runner#5

In this current setup, we're using the same instance type as Jenkins agents unless indicated otherwise.

Time comparison

Provision time

Provisioning time varies between the two. The process of provisioning an instance with Jenkins takes about ~2min, time I measured by hand because I did not find any data for this in Jenkins, while in GHA, it takes about ~4min actions/runs/10458288173 (see usage > Start self-hosted EC2 runner). See other builds for comparison actions/runs/10479549584, actions/runs/10479182910, actions/runs/10475792028

Build time

For building openedx images, the build step time in GHA takes about ~25 min, a similar time to this build stage in Jenkins with the same parameters. Those tests were executed with this simple configuration file: redwood/base/config.yml. For the latest redwood image redwood/base/config.yml, both take about the same time as well. You can go ahead and review other builds on Jenkins and GHA for comparison.

For building MFE images, the build step time in GHA takes about ~19 min, a similar time to the ~20min this build stage in Jenkins with the same parameters. Those tests were executed with this simple configuration file:

Total execution time:
The total execution recorded in GHA for building openedx images is 30m 59s actions/runs/10458288173/usage, while on Jenkins is ~31 min from “scheduled” to “completion” https://picasso.dedalo.edunext.co/job/PICASSO_V2/526/

The total execution recorded in GHA for building MFE images is 23m 36s actions/runs/10479182910/usage, while on Jenkins is 25min https://picasso.dedalo.edunext.co/job/PICASSO_V2/531/.

We must consider that not all jenkins steps have their corresponding translation in GHA, for example to do syntax checks, then GHA could take a little longer.

In this spreadsheet you can better review the time comparison across a considerable number of builds: https://docs.google.com/spreadsheets/d/1PtiRz4OBj3XxkY3xUUAi827vFtytWrox6ztK6XaqGkY/edit?usp=sharing

How to test

  1. We need to create a caller workflow that uses this reusable (called) workflow. For reference, see the implementation details. The build.yml workflow actively uses the picasso/build.yml workflow and sets the necessary configurations to run correctly.
  2. To manually trigger the ednx-strains/.github/build.yml workflow execution, go to Actions > Build Open edX strain, fill in the necessary configuration, or use the default. For self-hosted runners, use the workflow from the branch MJG/self-hosted-runners, also use the MJG/redwood-test-image strain branch to avoid overriding the current image on dockerhub.
  3. Press run workflow.
  4. Go to Actions > Runners > Self hosted, you'll see the new self-hosted runner there.

You won't be able to test until we generate a fine-grained PAT for cloning ednx-strain.

See here a successful build: https://github.com/eduNEXT/ednx-strains/actions/runs/10479622786

@mariajgrimaldi mariajgrimaldi changed the base branch from main to MJG/picasso-action August 19, 2024 19:35
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/self-hosted-runners branch from 6a4b6ce to 53c7410 Compare August 20, 2024 18:30
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/self-hosted-runners branch from 53c7410 to cbb00b2 Compare August 20, 2024 18:41
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/self-hosted-runners branch from a99d2f9 to 3fa35e5 Compare August 21, 2024 14:23
@mariajgrimaldi mariajgrimaldi changed the title Mjg/self hosted runners feat: use self-hosted runners to improve build performance Aug 21, 2024
@mariajgrimaldi mariajgrimaldi marked this pull request as ready for review August 21, 2024 18:34
@mariajgrimaldi mariajgrimaldi requested a review from a team August 21, 2024 19:39
@magajh
Copy link

magajh commented Aug 22, 2024

Thanks, @mariajgrimaldi!

It’s good to know that the overall build time is similar between both tools.

Regarding instance provisioning:
Is the cost in GitHub charged per provisioned instance? I’m not sure if I understood this correctly and want to clarify. Would the provisioning of a single instance in GitHub per job affect us in both time and cost?
Did you manage to test with multiple jobs in GitHub and check how did the total build times compare to Jenkins in this case?

Another thing that concerns me a bit are the Jenkins steps that haven’t been translated to GitHub Actions yet. Besides syntax checks, what other steps are still missing?


About the action not supporting spot instances, I’m not sure if this worries me too much, and I’m not inclined to make it a determining factor in the final decision. Considering it’s not something we currently have implemented in Jenkins, and -based on what I’ve read- looks like while using spot instances could significantly help reduce costs, you also need a way to manage spot interruptions and could get affected performance wise. So It generally seems better suited for flexible actions (?)

We could also look for another action in the list that supports spot instances or is expected to support them in the near future (not for this issue). I found a couple, but none of them seem very reliable.

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Aug 22, 2024

Thank you @magajh for your review!

Is the cost in GitHub charged per provisioned instance? I’m not sure if I understood this correctly and want to clarify. Would the provisioning of a single instance in GitHub per job affect us in both time and cost?

So, with GHA, each build job requires a new instance. Each AWS EC2 instance costs X.Y$ per minute, depending on the type. On Jenkins, a build job provisions an instance, each with three idle processes to be used. Since we're using the same instance type, each instance costs the same.

Let's say we trigger 6 builds 1 minute apart. GHA will provision 6 instances, while Jenkins will provision 2 (1 job for each idle process). Now, in Jenkins provisioning happens in the cluster, so that counts as part of the maintenance costs of the cluster itself. While in GHA, it occurs in the GH-hosted runners, so each minute counts in the billing ~4 minutes total (see GHA billing).

We can use these formulas for computing the cost of using AWS EC2 for building, considering the situation above:

GHA
X.Y * (∑ TOTAL_EXEC_TIME_FOR_JOB_i) + 6 * (~4min on GH hosted runners)

Jenkins
X.Y * (TOTAL_EXEC_TIME_FOR_JOB_1 + ∑ DELAY_JOB_i) + X.Y * 10min (Idle time before it's turned off)

That is a very rough guess. Does it make sense? This is ONLY to compare how GH compares to Jenkins in this situation. It should be taken as something other than a legitimate cost estimate.

While Jenkins supports multiple builds on the same instance to leverage resources, the GH action I chose doesn't due to a design choice. Therefore, this might increase AWS when triggering 6 build jobs sequentially with a delay since, in GH, they'd all be sequential. We could change the EC2 instance type to be small enough but still efficient. That's why I tried building with m5.xlarge, but the performance was terrible.

Another difference in the setup I explained is that GHA will take longer when triggering multiple builds, like the building of 6 images above, since it needs to provision each instance. In contrast, after the first build, the provision time won't count in Jenkins.

Did you manage to test with multiple jobs in GitHub and check how did the total build times compare to Jenkins in this case?

Since each job provisions a new instance when using GHA, build times will remain the same even with multiple jobs running simultaneously. This doesn't happen with Jenkins, each job uses the same shared resources, so building time increases slightly. I tested these last three jobs on Jenkins to support that claim: 541, 542, 543

image

Compared to when using a single job: https://picasso.dedalo.edunext.co/job/PICASSO_V2/533/

Another thing that concerns me a bit are the Jenkins steps that haven’t been translated to GitHub Actions yet. Besides syntax checks, what other steps are still missing?

Only syntax validations. To be sure, you can review the steps here: https://github.com/eduNEXT/dedalo-scripts/blob/main/jenkins/picasso_v2. Here's what's missing: https://github.com/eduNEXT/dedalo-scripts/blob/main/jenkins/picasso_v2#L152-L157. I'll try implementing this missing step today.

@Alec4r
Copy link
Member

Alec4r commented Aug 23, 2024

Thank you, @mariajgrimaldi , for the detailed PR and the thorough explanation. This is really helpful! I do have a few follow-up questions:

Impact of Spot Instances: You mentioned that spot instances could be used in Jenkins with additional plugins. How viable would it be to implement a similar solution in GHA to reduce costs? Are there any plans for the GHA action to support spot instances in the future?

Scalability: How do Jenkins and GHA compare when scaling the number of build jobs beyond 6? What happens to times and costs if, for example, 20 jobs are run simultaneously?

Future Improvements: Are there plans to improve the GHA action to support instance reuse or other mechanisms to optimize costs and time? Could we propose or contribute to those improvements?

Thanks again for your work on this!

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Aug 26, 2024

Thank you for your review, @Alec4r!

Impact of Spot Instances: You mentioned that spot instances could be used in Jenkins with additional plugins. How viable would it be to implement a similar solution in GHA to reduce costs? Are there any plans for the GHA action to support spot instances in the future?

Some work is going on to use launch templates that would support spot instance configurations: machulav/ec2-github-runner#65. However, it's been a work in progress since 2021, and there is no guarantee that it will be finished any time soon. I mentioned it since there was a proposal for Jenkins to use spot instances instead.

Although I can't see how Picasso would support spot instances when the build cannot be interrupted, but we can discuss that later. As for now, the action doesn't support provisioning spot instances, only on demand.

Scalability: How do Jenkins and GHA compare when scaling the number of build jobs beyond 6? What happens to times and costs if, for example, 20 jobs are run simultaneously?

Considering the formulas above, the estimated costs in a situation where 20 builds run simultaneously would be:

GHA
X.Y * (∑ TOTAL_EXEC_TIME_FOR_JOB_i) + 20 * (~4min on GH hosted runners)

Jenkins
X.Y * (TOTAL_EXEC_TIME_FOR_JOB_1 + ∑ DELAY_JOB_i) + X.Y * 10min (Idle time before it's turned off)

If TOTAL_EXEC_TIME_FOR_JOB_i = 30 min base time, X.Y = 0.384 $ x h and DELAY_JOB_i = 5, then:

GHA if we have gh hosted runner minutes to spare from the account:

0.384 * (20 * 30)/60 = 3.84$

If not:
0.384 * (20 * 30)/60 + 4 * 20 * 0.008 = 4.48$

Jenkins
0.384 * 7 * (32 + 2 * 5 ) / 60 + 0.384 * 10/60 = 1.96$

So, the total time building on an instance is less when there is overlap. Therefore, IN THEORY, costs should be lower only if this happens. In the worst-case scenario, all builds are sequential each running on a new instance, so there's no overlap.

We're assuming the following for this estimates to make sense:

  1. GHA always provisions a new instance, so if there are 20 jobs, then we'll consume 20*30 minutes of execution
  2. Since Jenkins can share resources per 3 triggered jobs, then we assume there are at most 3 jobs running on each instance so Jenkins provisions 20 % 3 = 6.66 = ~7 instances
  3. Jenkins takes up to 2 more minutes when sharing resources between jobs
  4. Since EC2 instances have an hourly rate, the minute rate would be ~ 0.384 / 60
  5. Each job takes 30 min running on an instance

As I mentioned in my comment above, this is ONLY to compare how GH compares to Jenkins in this situation, it shouldn't be taken as a cost estimate.

Future Improvements: Are there plans to improve the GHA action to support instance reuse or other mechanisms to optimize costs and time? Could we propose or contribute to those improvements?

So, this issue addresses the instance reuse issue: machulav/ec2-github-runner#4 (comment), but it still needs to be implemented. I don't know how much of an effort that would be. Anyway, I do think we can propose changes, although maintainers seem to take some time to review incoming PRs: https://github.com/machulav/ec2-github-runner/pulls, and the why is concerning machulav/ec2-github-runner#172.

This other action is relatively similar: https://github.com/NextChapterSoftware/ec2-action-builder with a faster maintainers response https://github.com/NextChapterSoftware/ec2-action-builder/pulls?q=is%3Apr+is%3Aclosed, although it's not as matured (was created on January of this year). In any case, we have limited options so I'll test it either way to compare.

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Aug 27, 2024

@eduNEXT/dedalo: I discovered while testing that this action doesn't support fine-grained PAT, so that's another issue to consider, here's a PR implementing it: machulav/ec2-github-runner#196 besides what I mentioned in my last comment:

although maintainers seem to take some time to review incoming PRs: https://github.com/machulav/ec2-github-runner/pulls, and the why is concerning machulav/ec2-github-runner#172.

This other action is relatively similar: https://github.com/NextChapterSoftware/ec2-action-builder with a faster maintainers response https://github.com/NextChapterSoftware/ec2-action-builder/pulls?q=is%3Apr+is%3Aclosed, although it's not as matured (was created on January of this year). In any case, we have limited options so I'll test it either way to compare.

Therefore, as I mentioned, I'll be testing another action instead https://github.com/NextChapterSoftware/ec2-action-builder. Build performance should remain the same since we're only changing how the runner's provision is done. Provisioning time might change, though. I'll let you know.

@MaferMazu
Copy link
Contributor

Thanks @mariajgrimaldi.
In my understanding:

  • Permissions: Each one in edunext has read permissions. But we can share the action if it is needed. Note about this: I think sharing the action in GA is more straightforward than sharing it in Jenkins if required.
  • Maintenance: The team is more familiar with GA.
  • Concurrency: It shouldn't be a problem.
  • Time: It is not so different.
  • Billing: It is not so clear, but based on this https://docs.google.com/document/d/1ES6vyJcYgT4y6oaE_-XeofLhDxchBYA07zaZ--bDlxM/edit?usp=sharing, Jenkins costs ~$22, and GA ~ costs $25. Should we add some extra cost to GA because we also need the AWS runners?
  • Usability (and Log parser): Can we test this or wait for fine-grained PAT? Also, it is important to see a failed job to see if we have false positives and what the logs look like.
    We added the log parser in Jenkins because we had false positives; I didn't expect to have something similar in GA. But catching false positives should be on our radar for future work if we have it.
    Jenkins fail: https://picasso.dedalo.edunext.co/job/PICASSO_V2/555/.
    Could you help me run a build with this strain branch mfmz/build-that-fails-redwood, strain redwood/base, service: openedx?

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Aug 28, 2024

Thank you for the review, @MaferMazu!

Permissions: Each one in edunext has read permissions. But we can share the action if it is needed. Note about this: I think sharing the action in GA is more straightforward than sharing it in Jenkins if required.

The idea of making the workflow public for everyone to consume comes from this: https://docs.github.com/en/actions/creating-actions/sharing-actions-and-workflows-from-your-private-repository#about-github-actions-access-to-private-repositories

Concurrency: It shouldn't be a problem.

What's different about this approach is how many instances are provisioned while building multiple images, ie. concurrently. While Jenkins supports multiple builds on the same instance to leverage resources, those GH actions don't support it out of the box. Therefore, for 6 different builds triggered 1 minute apart, GH will provision a new instance for each (sequential build), while Jenkins will provision 2 (1 job for each idle process).

Time: It is not so different.

Yes, the build time is similar. What could change is provisioning time for the runners.

Billing: It is not so clear, but based on this https://docs.google.com/document/d/1ES6vyJcYgT4y6oaE_-XeofLhDxchBYA07zaZ--bDlxM/edit?usp=sharing, Jenkins costs ~$22, and GA ~ costs $25. Should we add some extra cost to GA because we also need the AWS runners?

In the document, I presented the monthly billing for GH with GitHub-hosted runners and Jenkins with self-hosted runners. Considering that, the costs for the workflow using self-hosted runners change. The monthly cost for Jenkins in July was ~$22, meaning that using EC2 instances to build images costs $22. This is comparable to this setup's cost if we have GH runners minutes to spare from the main account. If not, we'll have to add those additional costs:

EC2_INSTANCE_COST_JULY = 22$
NUMBER_OF_BUILDS_JULY = 66$
PROVISIONG_TIME = 4
COST_PER_MINUTE_GH = 0.008

22 + 66 * 4 * 0.008 = 24.112$

This cost calculation is an estimate. EC2_INSTANCE_COST_JULY could be higher when using GHA since all builds are sequential.

Usability (and Log parser): Can we test this or wait for fine-grained PAT? Also, it is important to see a failed job to see if we have false positives and what the logs look like.

I'll let you know when I have the correct setup to test again. Thanks for the patience!

We added the log parser in Jenkins because we had false positives; I didn't expect to have something similar in GA. But catching false positives should be on our radar for future work if we have it. Jenkins fail: https://picasso.dedalo.edunext.co/job/PICASSO_V2/555/. Could you help me run a build with this strain branch mfmz/build-that-fails-redwood, strain redwood/base, service: openedx?

I was looking for a GHA option during implementation to replicate what Jenkins currently does, but I couldn't find any. However, we could implement something similar to: https://github.com/orgs/community/discussions/27097#discussioncomment-3254611, in our specific case the analyze error script would read the output and find the keywords we're looking for.

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Aug 29, 2024

Thank you for all your comments, @magajh @Alec4r @MaferMazu. I really appreciate it!

Here's where we currently stand:

  1. Build times in GH with self-hosted runners are similar to Jenkins'. For more details on the reports, please take a look at the spreadsheet I linked in the cover letter.
  2. Provisioning time varies in GH and will always be added to the total execution time due to the sequential nature of the implementation. Jenkins provisions an instance that could be used for up to 3 build jobs, which decreases time when building multiple images.
  3. Monthly billing with GH might increase since all builds are sequential. Jenkins allows multiple images to be built on the same instance, decreasing total build time by leveraging available resources.
  4. So far, we've tested two actions that implement self-hosted runners provisioning similar to our current setup: https://github.com/machulav/ec2-github-runner and https://github.com/NextChapterSoftware/ec2-action-builder. Here's a comparison between the two:
ec2-github-runner ec2-action-builder
Usability Easy to use. 1 instance per workflow run. Needs two jobs: start, stop runner. Easy to use. 1 instance per workflow run. Needs one job: start runner. Runner stops by using user data configurations on the AWS EC2 instance.
Maintenance Maintainer stepped down. Actively looking for maintainers, still no resolution: see latest discussion comment. 24 PRs, 40 issues. Latest PR was opened in June, and there is still no resolution. Last merge/release 6 months ago. Maintained by the author. 1 PR, 2 issues. Latest PR opened in January (draft). Last merge/release 3 weeks ago.
Community support Great. More than 3 people offered to maintain the project. 11 contributors. 4 contributors.
Security Uses classic tokens to register self-hosted runners, which raises security concerns. Support for fine-grained PAT support is underway: machulav/ec2-github-runner#196, no activity on the PR since opened (June). Uses fine-grained tokens with Administration access level under Repository permissions, which raises security concerns: NextChapterSoftware/ec2-action-builder#14. Maintainer was going to work on a solution in May, but there's no resolution yet.
Maturity Around since 2020. Created in January 2024.
Gaps No reuse of runners. But there's a roadmap for it: machulav/ec2-github-runner#65 and issues for other improvements: https://github.com/machulav/ec2-github-runner/issues No reuse of runners. No clear roadmap for it. Improvements requested in issues: https://github.com/NextChapterSoftware/ec2-action-builder/issues

Both actions work well, but ec2-github-runner looks more promising than ec2-action-builder. With more contributors, more improvement requests and contributions. However, as of today, no clear maintainer team has been appointed. That leaves us with ec2-action-builder, which currently has a critical security concern that needs to be addressed.

These are the options I see, considering what I mentioned about the actions I tested:

  1. Email the maintainer of ec2-github-runner asking for an update on maintainers and commit to a temporarily unmaintained project.
  2. Commit to using ec2-action-builder, which has an active maintainer, and contribute to fixing the security issue.
  3. Continue researching a better way of provisioning on-demand runners, instead of using the actions we've tested so far. We can explore the remaining items here: https://github.com/jonico/awesome-runners, like those using terraform, k8s, docker. This might mean changing our current setup.
  4. Temporarily stop all efforts to migrate to GH until we figure out how to move forward.

I don't think we'll find a perfect fit for what we're looking for, so we must commit to a decision and work from there. To make a decision, we should consider:

  • whether we have enough resources to contribute to other open-source projects like ec2-github-runner or ec2-action-builder,
  • or have time to wait for those projects to move forward,
  • or time to explore other actions less obvious (like those using terraform or k8s)

If the answer is that we don't have the resources or the time, then we should stop the effort altogether. I like this setup, but I can't see a way of committing to this solution without compromising on one of the first 3 options I mentioned.

Please let me know if you have any other ideas for moving forward. Thank you.

@magajh
Copy link

magajh commented Sep 2, 2024

@mariajgrimaldi thanks for all the work and thorough testing that's gone into this

Making this decision isn't easy, both tools have their pros and cons. However, we need to move forward. Here are my thoughts:

Considering that:

  • Maintaining the Jenkins implementation has been challenging, particularly in terms of defining clear maintenance tasks and spreading knowledge, with Alejandro being the only person truly familiar with it
  • According to the analysis, build times—which were a major concern—are similar to those in Jenkins, and they remain consistent even with multiple jobs running simultaneously.
  • GitHub provides a centralized communication platform, and most of the company is familiar with Github Actions
  • The company plans to phase out Jenkins from our production environment, so it will no longer be a tool that most of the tech team is familiar with or uses actively

Given these points, I think we should move forward with GitHub. Here are some next steps we could follow:

  • Investigate further into GitHub Actions to improve efficiency. Evaluate the options mentioned by Majo in her comment. Explore the best approach for self-hosted runners provisioning and commit to a solution, knowing that we will be staying with GitHub and doing what is necessary to make it work. We could initially use one of the two actions we've tested so far (possibly the currently implemented one) while we figure out our approach.
  • Implement the remaining changes to the GA to match the features of our current Jenkins setup (if necessary).
  • Review and merge the PR related to maintenance tasks in GitHub Actions (https://github.com/eduNEXT/edunext-internal-documentation/pull/206)

I’m also keen to hear the thoughts and opinions of others who are reviewing this PR

@mariajgrimaldi
Copy link
Contributor Author

mariajgrimaldi commented Sep 2, 2024

Thank you, @magajh. I agree. Here's what I've been thinking we could do immediately:

  1. Merge feat: add reusable workflow to build openedx images  #1 with github hosted runners as a temporary solution, even though times will increase for now.
  2. Migrate from using tutor-contrib-edunext-distro to tutor-contrib-picasso.
  3. Implement critical features: AWS registry support and false positive detection.
  4. Promote the use of the action internally and discourage the use of Jenkins. We can start after PR#1 is merged.

Then, we'll be able to follow up with what Maga suggested. However, if adopting the workflow solution without immediately addressing the increase in building time is a big issue, as Maga mentioned, we should decide which action to use after merging PR#1. This consists of choosing which security concern to handle for provisioning self-hosted runners (classic PATs or Fine-grained PATs with admin permissions) while we find a better solution.

@Alec4r: We discussed having a GH application with limited access for fine-grained PATs with admin permissions. Can we create one for this? Although IMO, that's a more considerable risk than using a classic PAT. So, can you help us create a classic PAT from a service account? If those are still too risky, as I mentioned previously, we should commit to the time increase for now.

@Alec4r
Copy link
Member

Alec4r commented Sep 3, 2024

Thank you, @magajh. I agree. Here's what I've been thinking we could do immediately:

1. Merge [feat: add reusable workflow to build openedx images  #1](https://github.com/eduNEXT/picasso/pull/1) with github hosted runners as a temporary solution, even though times will increase for now.

2. Migrate from using tutor-contrib-edunext-distro to tutor-contrib-picasso.

3. Implement critical features: AWS registry support and false positive detection.

4. Promote the use of the action internally and discourage the use of Jenkins. We can start after PR#1 is merged.

Then, we'll be able to follow up with what Maga suggested. However, if adopting the workflow solution without immediately addressing the increase in building time is a big issue, as Maga mentioned, we should decide which action to use after merging PR#1. This consists of choosing which security concern to handle for provisioning self-hosted runners (classic PATs or Fine-grained PATs with admin permissions) while we find a better solution.

@Alec4r: We discussed having a GH application with limited access for fine-grained PATs with admin permissions. Can we create one for this? Although IMO, that's a more considerable risk than using a classic PAT. So, can you help us create a classic PAT from a service account? If those are still too risky, as I mentioned previously, we should commit to the time increase for now.

I believe it’s better to avoid using classic PATs for several reasons:

  • Lack of Automatic Expiration: Classic PATs don’t have a defined expiration time, making them more vulnerable if not managed properly. While we can rotate them periodically, this adds administrative overhead and increases the risk of mishandling.

  • Granular Access: With a GitHub application, we can dynamically generate access tokens each time a workflow runs, with a limited lifespan, reducing the risk in case a token is compromised.

  • Security: Using fine-grained PATs with specific, managed permissions reduces the attack surface and is more secure than granting broad access with a classic PAT, which might have unnecessary permissions.

In summary, I believe the best long-term solution is to implement fine-grained PATs through a GitHub application, despite the initial setup effort this may require. This approach minimizes security risks and optimizes token management.

@mariajgrimaldi
Copy link
Contributor Author

I'll convert this to a draft until we can pick it up again. Thanks!

@mariajgrimaldi mariajgrimaldi marked this pull request as draft September 13, 2024 10:11
@mariajgrimaldi mariajgrimaldi deleted the branch MJG/picasso-action September 30, 2024 13:04
@mariajgrimaldi
Copy link
Contributor Author

So this PR was closed by accident...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants