batch_align.py loads up the whole query fasta into RAM #225

leoisl · 2023-04-03T14:15:53Z

See https://github.com/karel-brinda/mof-search/blob/e8e681b67538c3eadff2e577581a36183cd27303/scripts/batch_align.py#L150-L154

This clearly does not scale well when the query fasta is massive (e.g. read sets). One easy and quick way to save a bit more RAM is to just load queries that map to the given batch. Should I implement this @karel-brinda , as it is pretty quick to do? Of course if the whole or most of the read set still maps to the batch, we will still load lots of things. Only way through this I think is to create a fasta index on the query fasta and load only the fasta IDs, with the sequences being loaded from the disk by demand...

This does not matter much if mof-search use case does not concern read sets mapping, which is what I thought from the beginning, but I know you've been mapping ONT datasets with it...

karel-brinda · 2023-05-16T12:46:50Z

Completely agree that the current implementation is not great and this will have to be somehow addressed if we want to support even large query files. However, how specifically would you implement this?

Imagine you have eg a nanopore seq experiment and there's one batch where essentially all reads go. Then every ref genome can have basically any subset of reads. So how would having them in a distinct file help? Or do you mean that it would help the other batches to use less resources?

In this case, what about having just query names in these batch files? So that we don't store the same sequences too many times in hundreds of files.

leoisl · 2023-05-16T13:12:40Z

I don't quite remember this issue/the code well now, but there is definitely a way to not load up the whole fasta file into RAM, which can easily be tens of GB (we are loading the uncompressed sequences in a python dictionary, so it will take even more RAM than the uncompressed fasta in disk). I think we can simply store the identifiers for each read (e.g. <read file index, read index>; or read header, etc) and at least ignore the sequence, which would make it much more scalable. I have to get back into this code to know feasible options, but I think we should have a more scalable way rather than loading up the whole query into RAM...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_align.py loads up the whole query fasta into RAM #225

batch_align.py loads up the whole query fasta into RAM #225

leoisl commented Apr 3, 2023

karel-brinda commented May 16, 2023

leoisl commented May 16, 2023

batch_align.py loads up the whole query fasta into RAM #225

batch_align.py loads up the whole query fasta into RAM #225

Comments

leoisl commented Apr 3, 2023

karel-brinda commented May 16, 2023

leoisl commented May 16, 2023