Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): decrease ingest memory usage for large datasets #3505

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jan 8, 2025

resolves #

preview URL: https://ingest-memory-fixes.loculus.org/

Summary

When dealing with organisms with a large number of sequences (e.g. influenza A has 1.3M sequences) even just reading it the entire metadata is highly memory intensive - this PR switches to streaming metadata where ever possible and is similar to the change @corneliusroemer made in #2277 to stream sequences.

Note that in the case where segments are grouped the entire file still must be read into a dictionary.

Screenshot

PR Checklist

  • Improve batching: this still reads in both fasta and metadata and is thus the current memory bottleneck -> to remove this the input files need to be sorted
  • extend tests to ndjson files

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jan 8, 2025
@anna-parker anna-parker changed the title Ingest memory fixes fix(ingest): lower ingest memory usage for large datasets Jan 9, 2025
@anna-parker anna-parker changed the title fix(ingest): lower ingest memory usage for large datasets fix(ingest): decrease ingest memory usage for large datasets Jan 9, 2025
@anna-parker anna-parker marked this pull request as ready for review January 9, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant