Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backend): reprocessing mechanism and processing pipeline versions #1506

Merged
merged 3 commits into from
Mar 31, 2024

Conversation

chaoran-chen
Copy link
Member

@chaoran-chen chaoran-chen commented Mar 27, 2024

resolves #846
preview URL: https://reprocessing.loculus.org

Summary

  • Processing pipelines must now specify their version when they retrieve or submit sequences.
  • The data from a newer processing pipeline will be used when all sequences that had been successfully processed by the current pipeline were also successfully processed by the newer pipeline.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.
  • Adapt processing pipelines
    • Dummy pipeline
    • Nextclade pipeline (thanks to Cornelius!)

@chaoran-chen chaoran-chen added the preview Triggers a deployment to argocd label Mar 28, 2024
@fengelniederhammer
Copy link
Contributor

Please mind #1500. Maybe rebase this on top of that to prevent further merge conflicts? The benefit is simpler code.

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Mar 28, 2024

Can you describe what happens when the pipeline version increases:

  • which data do you send preferentially, new entries that haven't been processed at all, or those that need reprocessing? It would be good to send new sequences preferentially, to not cause delays for current submitters.
  • you say reprocessing results are only accepted if all reprocessed data doesn't error, so what happens with new data when there is an error in reprocessed sequences? Will we hence need to have two versions running in parallel to be able to have new submissions be processed on case the new version errors?

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Mar 28, 2024

Pipeline version is the entirety of the processing pipeline, not just the processing image but also its configuration in values.yaml

In order to have multiple pipeline versions running for upgrading/testing, we need to adjust the k8s configuration to allow deployment of multiple pipelines, not just one as is the case at the moment.

Is the pipeline version surfaced in Silo Export? I suppose it should be to make it verifiable by end users.

@corneliusroemer
Copy link
Contributor

Any clues about how to test this stuff? I suppose any tests could live entirely in the backend (unit) tests with mocked pipeline requests.

On the other hand, it's not so bad to test manually: one just makes requests with a pipeline with version X, then more requests with version X+1 and sees what the backend does. Extensive debug/trace logs would be super useful there.

@chaoran-chen chaoran-chen force-pushed the reprocessing branch 2 times, most recently from 07f8827 to 0849eb9 Compare March 28, 2024 17:23
@chaoran-chen
Copy link
Member Author

(Cornelius and I had a short call and clarified the questions.)

@chaoran-chen chaoran-chen marked this pull request as ready for review March 28, 2024 21:55
@chaoran-chen
Copy link
Member Author

@corneliusroemer, you're right, the error handling was wrong! Now, it should throw a 422.

@theosanderson
Copy link
Member

(looks really nice!)

chaoran-chen and others added 3 commits March 31, 2024 22:12
* Processing pipelines must now specify their version when they retrieve or submit sequences.
* The data from a newer processing pipeline will be used when all sequences that had been successfully processed by the current pipeline were also successfully processed by the newer pipeline.
* Adjust nextclade pipeline with version param
* Set pipeline version in values.yaml
* Adjust values.yaml and helm to allow multiple processing pipelines
* fix config volume
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Sorry, I didn't approve before because I thought there was still un-resolved stuff from Cornelius, but I was misunderstanding.

@chaoran-chen chaoran-chen merged commit b8aec2c into main Mar 31, 2024
13 checks passed
@chaoran-chen chaoran-chen deleted the reprocessing branch March 31, 2024 20:24
chaoran-chen added a commit that referenced this pull request Mar 31, 2024
* Processing pipelines must now specify their version when they retrieve or submit sequences.
* The data from a newer processing pipeline will be used when all sequences that had been successfully processed by the current pipeline were also successfully processed by the newer pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reprocessing
4 participants