Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not to buffer output sequences #20

Open
lh3 opened this issue Sep 27, 2024 · 4 comments
Open

Not to buffer output sequences #20

lh3 opened this issue Sep 27, 2024 · 4 comments

Comments

@lh3
Copy link
Collaborator

lh3 commented Sep 27, 2024

When we specify multiple assemblies with getset, agc seems to buffer all the output in memory. If we pipe the agc output to another program (e.g. ropebwt3) that consumes the output slowly, agc will take hundreds of GB of memory for human pangenome. It will be great if agc uses a fixed-sized buffer such that it does not consume too much memory when the output is blocked.

Another related but more challenging use case is to replace each FASTA with a unix pipe. For example

ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
  <(agc getset -pt1 genomes.agc asm2) \
  <(agc getset -pt1 genomes.agc asm3)

In this case, each agc instance may need to load the index into memory (is that right?). Is it possible to retrieve sequences without loading the entire index?

Using agc APIs wouldn't have these problems but for tools not using the APIs, it would be good to have a workaround.

@sebastiandeorowicz
Copy link
Member

Hello. Sorry for the delay.
AGC decompresses in parallel, but requires to output the contigs in the original order (as they were present in the input). Each contig can be decompressed by a separate thread. Thus, if the first contig is long, the already decompressed contigs must wait in the memory until the first one is ready for output. I can add a switch to output contigs out-of-order (but, of course, grouped by samples) if this can help.
In the second issue, agc loads part of the index if the archive is prepared using AGC 3.x (for the older AGC releases I use a different index organization that requires loading much more data to be loaded). Nevertheless, maybe I miss something. Can you share the example agc files for the experiments?

@lh3
Copy link
Collaborator Author

lh3 commented Oct 18, 2024

The contig order is necessary for some applications (e.g. ropebwt3). Multi-threading is less important. AGC on a single thread is already faster than many downstream tools. Memory is the main concern for me. My preference would be to have a single-thread mode with a small buffer.

You can find a human AGC file here for debugging. It was created with 3.1. -a is used as I might want to put primate genomes into the index in future.

PS: also elaborate the problem a bit further. When running minigraph or ropebwt3, we will put 300+ genomes on a command line

ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
  <(agc getset -pt1 genomes.agc asm2) \
  <(agc getset -pt1 genomes.agc asm3) \
  ...

If each agc instance takes ~2GB of memory, the total memory would be 300*2=600GB. It would be good if agc waits to fill a small buffer until the buffered sequences are consumed by downstream tools.

@sebastiandeorowicz
Copy link
Member

Thank you for the data and the elaboration. I'm starting to work on this.

@sebastiandeorowicz
Copy link
Member

AGC 3.2 with streaming decompression is now published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants