Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: alternative decompression interface #5

Open
DrMcCoy opened this issue Jan 27, 2019 · 3 comments
Open

Proposal: alternative decompression interface #5

DrMcCoy opened this issue Jan 27, 2019 · 3 comments

Comments

@DrMcCoy
Copy link
Owner

DrMcCoy commented Jan 27, 2019

A proposal for an alternative decompression interface that doesn't collect all files on opening, before any decompression is taking place. This would be useful if a linear decompression of the whole file is intended, especially if done from a medium where skipping through the whole file just to find all the files is too expensive.

  • dmc_unrar_archive_open_*_linear() functions to mirror the usual dmc_unrar_archive_open_*() functions, that don't fill in the file structures in the archive (and therefore, dmc_unrar_get_file_count() will return 0 for these).
  • A const dmc_unrar_file *dmc_unrar_next_file(dmc_unrar_archive *archive, const dmc_unrar_file *file) function that reads in one additional file block and returns the usual stats structure. If given NULL as the file parameter, it reads the first one. If NULL is returned, no further files are in the archive.
  • dmc_unrar_get_filename_file(), dmc_unrar_file_is_directory_file(), dmc_unrar_extract_file_to_*_file() functions that take a const dmc_unrar_file * instead of an index and do the usual.
  • (Additionally, since dmc_unrar_next_file() will grow the internal structures within the dmc_archive, the functions taking indices should still work on the files found so far. And also, a call to dmc_unrar_next_file() will invalidate previous const dmc_unrar_file *)
  • struct dmc_unrar_file will be expanded to contain some internal pointer to the archive, and also more user-readable fields like the index, offset within the file (for an estimation on the archive extraction progress).

@fasterthanlime, does that sound reasonable and like something you could use? Or am I going off in a totally wrong direction?

@fasterthanlime
Copy link
Contributor

I was hesitant to open a similar issue just a few minutes ago!

What you outlined fits our usecase exactly, with one important omission: extraction should be pausable/resumable, so... it's important that the data structures be serializable.

Here's how resumable decompressors operate in butler:

  • Every N seconds, they write a "savepoint" to disk
    • This includes various file offsets (input, output)
    • ...along with entry information, both current and all entries so far (if the archive format doesn't have a "central directory")
    • and decompressor state

Some example of checkpoint structs (in Golang, sorry..), from simplest to hairiest:

These are usually saved as part of a larger structure: for .tar.gz and .tar.bz2, for example, there's tarextractor:

https://github.com/itchio/savior/blob/fa53ef6e95620d2f5583580af460386f1fcb7190/tarextractor/tarextractor.go#L22-L25

(*savior.ExtractResult gets filled progressively with entries containing path, size, permissions, etc. - this is what I need to keep track of).

In dmc_unrar's case, looking at the complexity of actual file decompression, I fear it might be unreasonable to shoot for ResumeSupportBlock, see https://github.com/itchio/savior/blob/a9f8c3af201ef807ef6107294cb36bc3893bb02e/extractor.go#L85-L95 - but ResumeSupportEntry might be easy to achieve with the interface you suggested.

There might not even be a need to save that much internal dmc_unrar data. It needs to be as little info as possible, that lets dmc_unrar resume from a given entry, so, from my understanding, it would be:

  • what the overall RAR format is (maybe info extracted from some headers?)
  • what offset the last entry finished at
  • that's it?

So, for example, const dmc_unrar_file *dmc_unrar_next_file(dmc_unrar_archive *archive, const dmc_unrar_file *file) works great if you do streaming/linear decompression in one execution - but if you have to stop/resume from a disk checkpoint, then.. you have no dmc_unrar_file *file to pass.

Do you see what I'm getting at? Hopefully this is not too much!

@DrMcCoy
Copy link
Owner Author

DrMcCoy commented Jan 27, 2019

Hmm, yes, makes sense.

And yes, continuing from the middle of a file would be really complex. Restarting from the beginning of a file within the archive sounds feasible, though.

It should be possible to extract a few key integer values (offset, mostly, yeah) from the internal state given a dmc_unrar_archive *archive, const dmc_unrar_file *file pair, which you can then squirrel away. Another function would then reconstruct some of the internal state from these values and a newly opened linear archive and spit out the last const dmc_unrar_file *file.

Does that sound good?

@fasterthanlime
Copy link
Contributor

Does that sound good?

Yep, that sounds reasonable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants