Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crackle Format v1 (2?) #30

Open
2 of 3 tasks
william-silversmith opened this issue Jan 3, 2025 · 0 comments
Open
2 of 3 tasks

Crackle Format v1 (2?) #30

william-silversmith opened this issue Jan 3, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@william-silversmith
Copy link
Contributor

william-silversmith commented Jan 3, 2025

I should sit on this one for a little while as I try to make sure all the improvements stack into this update.

Possible improvements

  • Increase label bytes field to uint64 from uint32. Already encountered this limitation.
  • Add CRC checks
  • Extend format bitfield

Scratchpad for ideas

Features that would enhance the archival of large datasets safely.

  • Add crack code bytes field (there is already the crack index)

Now that pins are practical, are there ways to make using them on large datasets more efficient?

  • It should be possible to make pin decoding lower memory, but that's an implementation detail not a format change.

  • It would be possible to add a parity bit to the existing format without updating the format ID. This would enable the detection of single bit flip errors (or odd numbers) but not 2, 4, etc (even numbers).

  • Should the format bitfield be extended? There are two bits left.

CRC32

There is a fundamental conflict in some of the design goals here. For example, if a user calls "remap" and changes the coloring, do we have to decompress the changed slices and recompute CRC? That violates the point of having a fast remap.

What if we just checksumed the labels region and each VCG intermediate? Well, what if we wanted to change the VCG representation, it's entirely internal and it's also not a guarantee that the output will be interpreted correctly (though somewhat more of a guarantee than just the crack code bitstream). Another option is computing the CRC of the CCL output, which will be consistent regardless of label mapping.

It seems more feasible to have CRCs of the crackle binary than of the final output.

But if we compute the CRC of a crackle binary, what if the binary is 50GB? That's a lot of extra computation for e.g. remapping a single ID. If we protect the crack codes only for example, for a small file (e.g. 512x512x512, 4 * 512 = 2048 bytes would be CRC out of e.g. 3e6 bytes (or 0.06%) would be susceptible to false positives.

  • Add CRC32 integrity check field (good for up to 16GB files?).
    • Should CRC32 be of compressed or decompressed data?
    • Adding CRC is impractical unless the fastest CRC is obtained (tens of GB/sec) as it can more than double decompression time as measured on flywire with a 1GB/sec hash (I got another measurement using C++ of about 4GB/sec with Google's CRC32 code, adding about 10% to compression time, note, compiling in Release mode increased this to 23 GB/sec). This is because the compression ratio for this data is so extreme. I tested https://github.com/corsix/fast-crc32 on an arm64 Macbook Pro (M3) and got a maximum speed of 98.2 GB/sec, which is about two orders of magnitude better and would make this check nearly unnoticeable even on huge data.
    • CRC should be for the uncompressed data, so you know the decoder didn't mess up (or there wasn't a hardware glitch).
    • An efficient integrity check, such as using lzip, which has nice recovery semantics, should be used for the data at rest. However, crackle is intended to sit on disk for potentially long periods of time so that the data can be accessed at random.
    • It may not be possible to efficiently calculate the crc of an entire dataset as these files can represent files much larger than RAM (or even disk!) and decompression of the whole dataset can take a very long time (e.g. 4 sec * 7000 slices = 7.8 hrs) even if done slice by slice. Storing the CRC for the whole dataset also makes it impossible to zstack slices.
    • Storing 1 CRC per a slice allows recombining the data efficiently and allows you to know if a single slice is damaged, leaving the rest of the data intact. This would have implications though for lateral merging and would increase the false positive rate across all slices (4 bytes * 7000 files = 28kB of vulnerability).
    • Does the header need to be independently protected? Yes, the header contains the instructions for finding the per-slice CRCs.

Important to recognize crackle is an array format, not a streaming format (or should it be a stream of 2d grids?)

Candidate CRC Proposal

Design goals:

(A) CRC calculation should be a small, negligible fraction of the total compression or decompression time even for very large files.
(B) Allow error detection such that the corrupt data region can be identified while minimizing false positives (corrupt CRC but valid data).
(C) Allow decompressing valid slices if possible.
(D) The design should allow for dynamic splitting and merging of z slices without significant recalculation.
(E) It should allow for relabeling without having to recalculate the CRC for the entire volume or decompress the cracks.

Example:

   example.ckl is corrupted, slice 53 may not be accurate.
   example.ckl header is corrupted.
   example.ckl label mapping is corrupted. (In some cases, a forbidden label (particularly mutation to a high bit in the CC keys) may give away the corruption location.)
   example.ckl crack index is corrupted. (This could potentially be recovered automatically by parsing the crack codes.)
  • Use CRC32c with fastest possible codec (70-100 GB/sec on common machines, but commodity implementations may only be 1 GB/sec!)
  • CRC16 in a fixed position (end of header? end of file?) protects header, which contains "sz" which is necessary for finding other CRCs.
  • CRC16 protects crack index
  • sz CRC 32s (end of file? end of each crack code? after crack code index?) computed on CCL of each Z slice provides a stable target for validation.
  • 1 CRC 32 protects the labels region which is adjusted whenever labels are changed. (why not per sz? not compatible with pins for one.)

Total: 4 * (1 + sz) + 2 * 2 bytes = 4 * (sz + 2) bytes.
For 7000 sections: 4 * 7002 = 2808 bytes (out of many GB)

Let's say there is a single bit flip in 50e9 bits total file size, the probability of a false positive is 2808 / 50e9 or <0.00001%

What if it's a 256x256x64 file? 4 * (64 + 2) = 70 bytes
70 bytes in a 20kB file. False positive is 0.35%, which is relatively high. Tiny files are likely to be numerous.

@william-silversmith william-silversmith added the enhancement New feature or request label Jan 5, 2025
@william-silversmith william-silversmith changed the title Potential Updates to Crackle Format Crackle Format v1 Jan 6, 2025
@william-silversmith william-silversmith changed the title Crackle Format v1 Crackle Format v1 (2?) Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant