This document aims to describe the Pco format exactly.
All values encoded are unsigned integers. All bit packing (and thus integer encoding) is done in a little-endian fashion. Bit packing a component is completed by filling the rest of the byte with 0s.
Let dtype_size
be the data type's number of bits.
A "raw" value for a number is a dtype_size
value that maps to the number
via its from_unsigned
function.
The wrapped format consists of 3 components: header, chunk metadata, and data pages. Wrapping formats may encode these components any place they wish.
Pco is designed to have one header per file, possibly multiple chunks per header, and possibly multiple data pages per chunk.
The Pco format is versioned. It is typically expected that changes are made in a backwards-compatible way, so that decompressors can decompress any version of Pco up to their own version. This enables Pco to make small format changes in the future if necessary. The header simply consists of
- [8 bits] the format version
So far, these format versions exist:
format version | first Rust version | deviations from next format version |
---|---|---|
0 | 0.0.0 | int mult mode unsupported |
1 | 0.1.0 | float quant mode and 16-bit types unsupported |
2 | 0.3.0 | - |
It is expected that decompressors raise corruption errors if any part of metadata is out of range. For example, if the sum of bin weights does not equal the tANS size; or if a bin's offset bits exceed the data type size.
Each chunk metadata consists of
-
[4 bits]
mode
, using this table:| value | mode | n latent variables | 2nd latent uses delta? |
extra_mode_bits
| |-------|--------------|--------------------|------------------------|-------------------| | 0 | classic | 1 | | 0 | | 1 | int mult | 2 | no |dtype_size
| | 2 | float mult | 2 | no |dtype_size
| | 3 | float quant | 2 | no | 8 | | 4-15 | <reserved> | | | | -
[
extra_mode_bits
bits] for certain modes, extra data is parsed. See the mode-specific formulas below for how this is used, e.g. as themult
ork
values. -
[3 bits] the delta encoding order
delta_order
. -
per latent variable,
- [4 bits]
ans_size_log
, the log2 of the size of its tANS table. This may not exceed 14. - [15 bits] the count of bins
- per bin,
- [
ans_size_log
bits] 1 less thanweight
, this bin's weight in the tANS table - [
dtype_size
bits] the lower bound of this bin's numerical range, encoded as a raw value. - [
log2(dtype_size) + 1
bits] the number of offset bits for this bin e.g. for a 64-bit data type, this will be 7 bits long.
- [
- [4 bits]
Based on chunk metadata, 4-way interleaved tANS decoders should be initialized
using
the simple spread_state_tokens
algorithm from this repo.
If there are n
numbers in a data page, it will consist of ceil(n / 256)
batches. All but the final batch will contain 256 numbers, and the final
batch will contain the rest (<= 256 numbers).
Each data page consists of
- per latent variable,
- if delta encoding is applicable, for
i in 0..delta_order
,- [
dtype_size
bits] thei
th delta moment
- [
- for
i in 0..4
,- [
ans_size_log
bits] thei
th interleaved tANS state index
- [
- if delta encoding is applicable, for
- [0-7 bits] 0s until byte-aligned
- per batch of
k
numbers,- per latent variable,
- for
i in 0..k
,- [tANS state
i % 4
's bits] tANS encoded bin idx for thei
th latent. Store the bin asbin[i]
. Asymmetric Numeral System links: original paper, blog post explanation.
- [tANS state
- for
i in 0..k
,- [
bin[i].offset_bits
bits] offset fori
th latent
- [
- for
- per latent variable,
The standalone format is a minimal implementation of the wrapped format. It consists of
- [32 bits] magic header (ASCII for "pco!")
- [8 bits] standalone version
- [6 bits] 1 less than
n_hint_log2
- [
n_hint_log2
bits] the total count of numbers in the file, if known; 0 otherwise - [0-7 bits] 0s until byte-aligned
- a wrapped header
- per chunk,
- [8 bits] a byte for the data type
- [24 bits] 1 less than
chunk_n
, the count of numbers in the chunk - a wrapped chunk metadata
- a wrapped data page of
chunk_n
numbers
- [8 bits] a magic termination byte (0).
Based on the mode, unsigneds are decomposed into latents.
Let l0
and l1
be the primary and secondary latents respectively.
Let MID
be the middle value for the latent type (e.g. 2^31 for u32
).
mode | decoding formula |
---|---|
classic | from_latent_ordered(l0) |
int mult | from_latent_ordered(l0 * mult + l1) |
float mult | int_float_from_latent(l0) * mult + (l1 + MID) ULPs |
float quant | from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1) |
Here ULP refers to unit in the last place.
Each data type has an order-preserving bijection to an unsigned data type. For instance, floats have their first bit toggled, and the rest of their bits bits toggled if the float was originally negative:
fn from_unsigned(unsigned: u32) -> f32 {
if unsigned & (1 << 31) > 0 {
// positive float
f32::from_bits(unsigned ^ (1 << 31))
} else {
// negative float
f32::from_bits(!unsigned)
}
}
Signed integers have an order-preserving wrapping addition and wrapping conversion like so:
fn from_unsigned(unsigned: u32) -> i32 {
i32::MIN.wrapping_add(unsigned as i32)
}
Latents are converted to deltas by taking consecutive differences
delta_order
times, and decoded by taking a cumulative sum repeatedly.
Delta moments are emitted during encoding and consumed during decoding to
initialize the cumulative sum.
For instance, with 2nd order delta encoding, the delta moments [1, 2]
and the deltas [0, 10, 0]
would decode to the latents [1, 3, 5, 17, 29]
.
To dissect the deltas, we find the bin that contains each delta x
and compute
its offset as x - bin.lower
.
For instance, suppose we have these bins, where we have compute the upper bound
for convenience:
bin idx | lower | offset bits | upper (inclusive) |
---|---|---|---|
0 | 7 | 2 | 10 |
1 | 10 | 3 | 17 |
Then 8 would be in bin 0 with offset 1, and 15 would be in bin 1 with offset 5. 10 could be encoded either as bin 0 with offset 3 or bin 1 with offset 0.