Not to buffer output sequences #20

lh3 · 2024-09-27T17:35:07Z

When we specify multiple assemblies with getset, agc seems to buffer all the output in memory. If we pipe the agc output to another program (e.g. ropebwt3) that consumes the output slowly, agc will take hundreds of GB of memory for human pangenome. It will be great if agc uses a fixed-sized buffer such that it does not consume too much memory when the output is blocked.

Another related but more challenging use case is to replace each FASTA with a unix pipe. For example

ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
  <(agc getset -pt1 genomes.agc asm2) \
  <(agc getset -pt1 genomes.agc asm3)

In this case, each agc instance may need to load the index into memory (is that right?). Is it possible to retrieve sequences without loading the entire index?

Using agc APIs wouldn't have these problems but for tools not using the APIs, it would be good to have a workaround.

The text was updated successfully, but these errors were encountered:

sebastiandeorowicz · 2024-10-17T18:47:11Z

Hello. Sorry for the delay.
AGC decompresses in parallel, but requires to output the contigs in the original order (as they were present in the input). Each contig can be decompressed by a separate thread. Thus, if the first contig is long, the already decompressed contigs must wait in the memory until the first one is ready for output. I can add a switch to output contigs out-of-order (but, of course, grouped by samples) if this can help.
In the second issue, agc loads part of the index if the archive is prepared using AGC 3.x (for the older AGC releases I use a different index organization that requires loading much more data to be loaded). Nevertheless, maybe I miss something. Can you share the example agc files for the experiments?

lh3 · 2024-10-18T12:59:37Z

The contig order is necessary for some applications (e.g. ropebwt3). Multi-threading is less important. AGC on a single thread is already faster than many downstream tools. Memory is the main concern for me. My preference would be to have a single-thread mode with a small buffer.

You can find a human AGC file here for debugging. It was created with 3.1. -a is used as I might want to put primate genomes into the index in future.

PS: also elaborate the problem a bit further. When running minigraph or ropebwt3, we will put 300+ genomes on a command line

ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
  <(agc getset -pt1 genomes.agc asm2) \
  <(agc getset -pt1 genomes.agc asm3) \
  ...

If each agc instance takes ~2GB of memory, the total memory would be 300*2=600GB. It would be good if agc waits to fill a small buffer until the buffered sequences are consumed by downstream tools.

sebastiandeorowicz · 2024-10-21T16:19:16Z

Thank you for the data and the elaboration. I'm starting to work on this.

sebastiandeorowicz · 2024-11-25T07:34:01Z

AGC 3.2 with streaming decompression is now published.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not to buffer output sequences #20

Not to buffer output sequences #20

lh3 commented Sep 27, 2024

sebastiandeorowicz commented Oct 17, 2024

lh3 commented Oct 18, 2024 •

edited

Loading

sebastiandeorowicz commented Oct 21, 2024

sebastiandeorowicz commented Nov 25, 2024

Not to buffer output sequences #20

Not to buffer output sequences #20

Comments

lh3 commented Sep 27, 2024

sebastiandeorowicz commented Oct 17, 2024

lh3 commented Oct 18, 2024 • edited Loading

sebastiandeorowicz commented Oct 21, 2024

sebastiandeorowicz commented Nov 25, 2024

lh3 commented Oct 18, 2024 •

edited

Loading