Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Video processing speeds #852

Open
ShanaLMoore opened this issue Oct 12, 2023 · 7 comments
Open

🐛 Video processing speeds #852

ShanaLMoore opened this issue Oct 12, 2023 · 7 comments
Assignees
Labels
bug Something isn't working Contribute Back maintenance bills to maintenance

Comments

@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Oct 12, 2023

Summary

⚠️ This temp fix needs to be undone to properly work on this issue:

ref slack convo: https://assaydepot.slack.com/archives/C0313NK5NMA/p1701391872277629
more context here

tldr; 1hr long videos takes days to process

Nic had a client demo and noticed that the AV was not loading his video, the day before the meeting. He had assumed the 59 minute video had finished processing because he uploaded it the day before.

After looking into it, Kirk and Shana discovered that the video was still processing.

We implemented a hack in the meantime, to help Nic be successful with his demo, however Rob requested we make a ticket for him to look into what appears to be a bug with processing.

Additionally we wrote a script to remove mp4 related jobs, and Nic has asked their customers to not upload mp4s. These jobs have a schedule_at date 6 months from now.

After this work is done, the dev should respawn the mp4 jobs we took out of the queue, to unblock everything else.

related

Questions

Is it OK to leave our current hack in place, indefinitely? If not, create a ticket for us to undo our "hack", after this ticket has been addressed.

Screenshot

Image

Image

Testing Instructions

  1. Login as an admin
  2. Open a new tab and navigate to /sidekiq/busy. Click the Live Poll button in the top right (should be green)
  3. Create a new work. Give it a video file
    a) I (@bkiahstroud) recommend a video of ~5 minutes in length to test first (see Step 7a)
  4. Quickly switch tabs back to the Sidekiq dashboard (step 2). Watch the jobs as they process (i.e. appear and disappear)
  5. Verify that a job called CreateLargeDerivativesJob shows up at some point and that it gets put in the auxilliary queue
  6. Verify that the CreateLargeDerivativesJob does not fail (i.e. it should not go back and forth between the Retries queue and the Busy queue)
    a) If the same CreateLargeDerivativesJob keeps disappearing and reappearing on the Busy page, check the Retries tab of the Sidekiq dashboard to see if it keeps showing up there. This is an indication that it is failing and retrying over and over
  7. Verify that the CreateLargeDerivativesJob does not take "too long"
    a) A video between 5-10 minutes should not take longer than 6 hours to process
  8. (Dev only) Verify that the derivative file(s) get created successfully
  9. Repeat steps 3-8 with an audio file instead of a video

Notes

related convo:
https://assaydepot.slack.com/archives/C0313NKC08L/p1697141000008749

@ShanaLMoore ShanaLMoore changed the title Video processing speeds 🐛 Video processing speeds Oct 12, 2023
@ShanaLMoore ShanaLMoore added the bug Something isn't working label Oct 12, 2023
@ndroark
Copy link
Collaborator

ndroark commented Oct 13, 2023

@jeremyf
Copy link
Contributor

jeremyf commented Nov 30, 2023

One thing to consider is that Hyrax by default creates two derivatives for a video: webm and mp4. That’s twice the amount of derivative generation.

Maybe we could see about removing one of those format types. The following code is the reference: https://github.com/samvera/hyrax/blob/b8c4fa4c8fddbb4d4d4b89fc4b514bd6d5d83928/app/services/hyrax/file_set_derivatives_service.rb#L98-L103

@jillpe jillpe added the maintenance bills to maintenance label Nov 30, 2023
@ShanaLMoore
Copy link
Contributor Author

I wrote and ran the this script to unblock PALS in the meantime. It finds the mp4 jobs and schedules them to a later date, to unblock the bottleneck.

@bkiahstroud
Copy link
Contributor

bkiahstroud commented Jan 3, 2024

The problem

The underlying issue appears to be a CPU bottleneck. Long-running ffmpeg commands are most likely being throttled.

Proposed fix

When an audio or video derivative job is triggered, we should put in into a separate Sidekiq queue (e.g. "ffmpeg"). After that, at an ops level, have the default 3 workers run all other queues only. Then have a 4th, separate worker that runs all the other queues plus the "ffmpeg" queue. The new, additional worker has a CPU limit of 4 (higher than default) and 1 thread (lower than default).

This effectively creates a powerful "slow lane". "ffmpeg" jobs will slow down the worker while they're running, but they won't bog down all the jobs since the other three workers are still running.

Make the "ffmpeg" queue priority equal to the default queue. If Steve starts processing a PDF at 10am, and Billy starts processing a video at 11am, it makes sense that the video should have to wait for the PDFs to finish. For Billy, the real-life difference between waiting 20 minutes and 60 minutes is negligible.

@jillpe
Copy link

jillpe commented Jan 29, 2024

SoftServ QA: ✅

(Video created)[https://dev.commons-archive.org/concern/oers/a2283fe4-e88f-4bdd-96a3-e5a8641218a4?locale=en]
Took a total of 1 hour and 15ish minutes to run the jobs.

@jillpe jillpe moved this from SoftServ QA to PALs QA in palni-palci Jan 29, 2024
@ndroark ndroark moved this from PALs QA to Deploy to Production in palni-palci Jan 31, 2024
@jillpe jillpe moved this from Deploy to Production to Ready for Development in palni-palci Feb 1, 2024
@bkiahstroud bkiahstroud moved this from Ready for Development to Code Review in palni-palci Feb 14, 2024
@bkiahstroud bkiahstroud moved this from Code Review to Deploy to Staging in palni-palci Feb 14, 2024
@bkiahstroud
Copy link
Contributor

Blocked

Both worker and workerAuxiliary deployments fail. The pods spawn and immediately get stuck in an infinite loop trying to connect to redis. The GitHub deployment action itself fails (example) with this error:

Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: timed out waiting for the condition: no ConfigMap with the name "palni-palci-demo-redis" found

A potential cause of this issue could be that the Hyrax chart version was upgraded (diff). Worth noting is that base Hyku is using the same chart version we are and doesn't seem to have this issue

@bkiahstroud bkiahstroud moved this from Deploy to Staging to SoftServ QA in palni-palci Feb 26, 2024
@jillpe
Copy link

jillpe commented Feb 26, 2024

SoftServ QA: ✅

Work

Image

@jillpe jillpe moved this from SoftServ QA to PALs QA in palni-palci Feb 26, 2024
@ndroark ndroark moved this from PALs QA to Deploy to Production in palni-palci Feb 27, 2024
@bkiahstroud bkiahstroud moved this from Deploy to Production to Client Verification in palni-palci Mar 4, 2024
@ndroark ndroark moved this from Client Verification to Done in palni-palci Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Contribute Back maintenance bills to maintenance
Projects
Status: Done
Development

No branches or pull requests

6 participants