Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant BAM file open in paired mode #32

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

siddharthab
Copy link

@siddharthab siddharthab commented Sep 4, 2024

Fixes #31.

Opening a BAM file is an expensive operation as the index needs to be
fully read. In paired reads mode, at every contig change, the file was
being opened again to iterate over all reads from the previous contig.
This is usually not an issue for genome alignments, but transcriptome
alignments may have ~100k contigs, which makes this an expensive
operation.

Ideally, the two-pass mode should not have to read the file again, and
instead just maintain a rolling window of reads in memory.

Opening a BAM file is an expensive operation as the index needs to be
fully read. In paired reads mode, at every contig change, the file was
being opened again to iterate over all reads from the previous contig.
This is usually not an issue for genome alignments, but transcriptome
alignments may have ~100k contigs, which makes this an expensive
operation.

Ideally, the two-pass mode should not have to read the file again, and
instead just maintain a rolling window of reads in memory.
@siddharthab
Copy link
Author

siddharthab commented Sep 4, 2024

With this change, the test case in the linked issue takes 14 minutes now instead of 6.3 hours.

@siddharthab
Copy link
Author

@Daniel-Liu-c0deb0t Can you please accept this PR?

@MatthiasZepper
Copy link

I just wanted to express explicit support for this proposal!

While I am not familiar with the implementation details, I think, it is a very important fix. Transcriptomic alignments or draft genome assemblies typically have numerous contigs and if this fix streamlines the deduplication of those input files so dramatically, I would love to see it merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very slow paired reads mode for transcriptome
2 participants