Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag to overwrite existing samples? #6

Open
niemasd opened this issue Mar 22, 2023 · 4 comments
Open

Flag to overwrite existing samples? #6

niemasd opened this issue Mar 22, 2023 · 4 comments

Comments

@niemasd
Copy link

niemasd commented Mar 22, 2023

Thanks again for this awesome tool! I'm trying to compress a large dataset of viral genomes, and sometimes we get updated genome sequences for an existing ID. I like that the default behavior is for agc to (seemingly) skip existing sample names and print an error message, but in some cases, it might be nice to have an "overwrite" flag to force replacing the old entry with the new one. Would such a feature be feasible?

More broadly, I see that there's no option to remove specific sequence(s) from the archive; would this be feasible to add, or would one have to extract everything, remove those specific entries, and recreate the archive?

@sebastiandeorowicz
Copy link
Member

Well, removing (or overwriting existing) samples is a difficult task. This is related to how agc handles contigs. Each one is split into segments (substrings of length a few (tens) kbp).

The segments with same edge k-mers are grouped together. One of them (the first one) is taken as a reference (usually it comes from the reference samples, but, in general, can be also from some other sample). Then, all other segments in the group are LZ-encoded using the reference. Then, the LZ-encoded segments are divided into batches. Each batch is zstd-compressed.

Let's think what happens when a single segment needs to be removed. If this is the reference segment for a group, all other segments (from other samples) must be decompressed, new reference must be selected, all other segments must be LZ-parsed using the new reference, and, finally, after dividing into new batches, they must be zstd-compressed. If the segment-to-remove is not a reference, it is a bit simpler. It suffices to zstd-deompress the batches in the group, remove the segment, form new batches, and zstd-compress. In both cases, it would be necessary to decompress the metadata describing the collection (segment ids will change for segments after the to-remove-one) and carefully changing segment ids.

Summing up, this is not easy. I'm adding this feature to my TODO list. I hope, we will be able to implement this in the future but not in the forthcoming v.3.1 release.

There is also one more problem. Currently, you can construct archive containing sample A. Then, add sample B. Then, add C, then D, then E. The archive will be exactly the same (if you do not use command-line storage) if you just construct the archive with samples A+B+C+D+E at a single run. Unfortunately, if the above-describing-sample removal will be implemented, and you compare the archives after building it from A+B+C+E with the one obtained after building for A+B+C+D+E and removing D, you would notice differences. The archives will contain exactly the same data, but the binary representation will not be the same. I do not know if this is an important issue. This can be solved but this would be equivalent to decompression of the whole archive, removing one sample and compression from scratch. This would take time.

@niemasd
Copy link
Author

niemasd commented Mar 4, 2024

Thank you, this is super helpful context! I appreciate the thorough and thoughtful details, and it helps me better understand how agc is working in my mind

There is also one more problem. Currently, you can construct archive containing sample A. Then, add sample B. Then, add C, then D, then E. The archive will be exactly the same (if you do not use command-line storage) if you just construct the archive with samples A+B+C+D+E at a single run. Unfortunately, if the above-describing-sample removal will be implemented, and you compare the archives after building it from A+B+C+E with the one obtained after building for A+B+C+D+E and removing D, you would notice differences. The archives will contain exactly the same data, but the binary representation will not be the same. I do not know if this is an important issue. This can be solved but this would be equivalent to decompression of the whole archive, removing one sample and compression from scratch. This would take time.

I think this specific issue (same data, different representation) is probably tolerable with respect to current state-of-the-art databases. For example, if I'm understanding correctly, SQLite runs into a similar issue (similar from the user perspective; I imagine quite different under the hood) when a user removes, adds, removes, etc. many entries. Performance degrades and file size is artificially large due to fragmented data, which is why SQLite provides the VACUUM command to rebuild the database. I can imagine a similar functionality in agc (e.g. "rebuild") that could be a time-consuming operation the user could choose to (infrequently) use, which performs agc compression on the entire dataset from scratch

Either way, I'll keep following your awesome work, and I look forward to trying out the new updates 😄

@sebastiandeorowicz
Copy link
Member

AGC 3.1 is ready. At the moment removing/updating is not implemented. We think about some provisional implementation in one of the next releases.

@niemasd
Copy link
Author

niemasd commented Mar 18, 2024

Thanks for the update! Looking forward to it 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants