-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not to buffer output sequences #20
Comments
Hello. Sorry for the delay. |
The contig order is necessary for some applications (e.g. ropebwt3). Multi-threading is less important. AGC on a single thread is already faster than many downstream tools. Memory is the main concern for me. My preference would be to have a single-thread mode with a small buffer. You can find a human AGC file here for debugging. It was created with 3.1. PS: also elaborate the problem a bit further. When running minigraph or ropebwt3, we will put 300+ genomes on a command line ropebwt3 build -bo out.fmr <(agc getset -pt1 genomes.agc asm1) \
<(agc getset -pt1 genomes.agc asm2) \
<(agc getset -pt1 genomes.agc asm3) \
... If each agc instance takes ~2GB of memory, the total memory would be 300*2=600GB. It would be good if agc waits to fill a small buffer until the buffered sequences are consumed by downstream tools. |
Thank you for the data and the elaboration. I'm starting to work on this. |
AGC 3.2 with streaming decompression is now published. |
When we specify multiple assemblies with
getset
, agc seems to buffer all the output in memory. If we pipe the agc output to another program (e.g. ropebwt3) that consumes the output slowly, agc will take hundreds of GB of memory for human pangenome. It will be great if agc uses a fixed-sized buffer such that it does not consume too much memory when the output is blocked.Another related but more challenging use case is to replace each FASTA with a unix pipe. For example
In this case, each agc instance may need to load the index into memory (is that right?). Is it possible to retrieve sequences without loading the entire index?
Using agc APIs wouldn't have these problems but for tools not using the APIs, it would be good to have a workaround.
The text was updated successfully, but these errors were encountered: