-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag to overwrite existing samples? #6
Comments
Well, removing (or overwriting existing) samples is a difficult task. This is related to how agc handles contigs. Each one is split into segments (substrings of length a few (tens) kbp). The segments with same edge k-mers are grouped together. One of them (the first one) is taken as a reference (usually it comes from the reference samples, but, in general, can be also from some other sample). Then, all other segments in the group are LZ-encoded using the reference. Then, the LZ-encoded segments are divided into batches. Each batch is zstd-compressed. Let's think what happens when a single segment needs to be removed. If this is the reference segment for a group, all other segments (from other samples) must be decompressed, new reference must be selected, all other segments must be LZ-parsed using the new reference, and, finally, after dividing into new batches, they must be zstd-compressed. If the segment-to-remove is not a reference, it is a bit simpler. It suffices to zstd-deompress the batches in the group, remove the segment, form new batches, and zstd-compress. In both cases, it would be necessary to decompress the metadata describing the collection (segment ids will change for segments after the to-remove-one) and carefully changing segment ids. Summing up, this is not easy. I'm adding this feature to my TODO list. I hope, we will be able to implement this in the future but not in the forthcoming v.3.1 release. There is also one more problem. Currently, you can construct archive containing sample A. Then, add sample B. Then, add C, then D, then E. The archive will be exactly the same (if you do not use command-line storage) if you just construct the archive with samples A+B+C+D+E at a single run. Unfortunately, if the above-describing-sample removal will be implemented, and you compare the archives after building it from A+B+C+E with the one obtained after building for A+B+C+D+E and removing D, you would notice differences. The archives will contain exactly the same data, but the binary representation will not be the same. I do not know if this is an important issue. This can be solved but this would be equivalent to decompression of the whole archive, removing one sample and compression from scratch. This would take time. |
Thank you, this is super helpful context! I appreciate the thorough and thoughtful details, and it helps me better understand how
I think this specific issue (same data, different representation) is probably tolerable with respect to current state-of-the-art databases. For example, if I'm understanding correctly, SQLite runs into a similar issue (similar from the user perspective; I imagine quite different under the hood) when a user removes, adds, removes, etc. many entries. Performance degrades and file size is artificially large due to fragmented data, which is why SQLite provides the Either way, I'll keep following your awesome work, and I look forward to trying out the new updates 😄 |
AGC 3.1 is ready. At the moment removing/updating is not implemented. We think about some provisional implementation in one of the next releases. |
Thanks for the update! Looking forward to it 😄 |
Thanks again for this awesome tool! I'm trying to compress a large dataset of viral genomes, and sometimes we get updated genome sequences for an existing ID. I like that the default behavior is for
agc
to (seemingly) skip existing sample names and print an error message, but in some cases, it might be nice to have an "overwrite" flag to force replacing the old entry with the new one. Would such a feature be feasible?More broadly, I see that there's no option to remove specific sequence(s) from the archive; would this be feasible to add, or would one have to extract everything, remove those specific entries, and recreate the archive?
The text was updated successfully, but these errors were encountered: