Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_align.py loads up the whole query fasta into RAM #225

Open
leoisl opened this issue Apr 3, 2023 · 2 comments
Open

batch_align.py loads up the whole query fasta into RAM #225

leoisl opened this issue Apr 3, 2023 · 2 comments

Comments

@leoisl
Copy link
Collaborator

leoisl commented Apr 3, 2023

See https://github.com/karel-brinda/mof-search/blob/e8e681b67538c3eadff2e577581a36183cd27303/scripts/batch_align.py#L150-L154

This clearly does not scale well when the query fasta is massive (e.g. read sets). One easy and quick way to save a bit more RAM is to just load queries that map to the given batch. Should I implement this @karel-brinda , as it is pretty quick to do? Of course if the whole or most of the read set still maps to the batch, we will still load lots of things. Only way through this I think is to create a fasta index on the query fasta and load only the fasta IDs, with the sequences being loaded from the disk by demand...

This does not matter much if mof-search use case does not concern read sets mapping, which is what I thought from the beginning, but I know you've been mapping ONT datasets with it...

@karel-brinda
Copy link
Owner

Completely agree that the current implementation is not great and this will have to be somehow addressed if we want to support even large query files. However, how specifically would you implement this?

Imagine you have eg a nanopore seq experiment and there's one batch where essentially all reads go. Then every ref genome can have basically any subset of reads. So how would having them in a distinct file help? Or do you mean that it would help the other batches to use less resources?

In this case, what about having just query names in these batch files? So that we don't store the same sequences too many times in hundreds of files.

@leoisl
Copy link
Collaborator Author

leoisl commented May 16, 2023

I don't quite remember this issue/the code well now, but there is definitely a way to not load up the whole fasta file into RAM, which can easily be tens of GB (we are loading the uncompressed sequences in a python dictionary, so it will take even more RAM than the uncompressed fasta in disk). I think we can simply store the identifiers for each read (e.g. <read file index, read index>; or read header, etc) and at least ignore the sequence, which would make it much more scalable. I have to get back into this code to know feasible options, but I think we should have a more scalable way rather than loading up the whole query into RAM...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants