This blog post is one of a seven part series.
- Part 1: Concepts
- Part 2: Structure
- Part 3: Bitstreams
- Part 4: Huffman Codes
- Part 5: Finite State Entropy Codes
- Part 6: Sequences
- Part 7: Dictionaries
Zstd or Zstandard (RFC 8478, first released in 2015) is a popular modern compression algorithm. It's smaller (better compression ratio) and faster than the ubiquitous Zlib/Deflate (RFC 1950 and RFC 1951, first released in 1995).
The production quality reference implementation is open source and its repository also includes a second educational decoder implementation, emphasizing clarity instead of performance. The file format specification (RFC 8478) is also freely available and well written.
Nonetheless, I like to learn a file format by studying a worked example at the bits and bytes level. I searched but couldn't find one, so I dissected my own zst file and wrote it up as this blog post series.
We'll use this romeo.txt
file for our original input.
$ cat romeo.txt
Romeo and Juliet
Excerpt from Act 2, Scene 2
JULIET
O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO
[Aside] Shall I hear more, or shall I speak at this?
JULIET
'Tis but thy name that is my enemy;
Thou art thyself, though not a Montague.
What's Montague? it is nor hand, nor foot,
Nor arm, nor face, nor any other part
Belonging to a man. O, be some other name!
What's in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call'd,
Retain that dear perfection which he owes
Without that title. Romeo, doff thy name,
And for that name which is no part of thee
Take all myself.
ROMEO
I take thee at thy word:
Call me but love, and I'll be new baptized;
Henceforth I never will be Romeo.
JULIET
What man art thou that thus bescreen'd in night
So stumblest on my counsel?
Applying Zstandard compression (with the default settings) produces
romeo.txt.zst
, 59% of the original file size.
$ zstd romeo.txt
$ ls -l romeo.txt* | awk '{print $5 " " $9}'
942 romeo.txt
559 romeo.txt.zst
Like Zlib and many other compression algorithms (but unlike e.g. Bzip2), a key mechanic in Zstandard has roots in Lempel and Ziv's influential LZ77 approach. It partitions the input bytes into two categories: literals and matches (matches are also called copies or back references). Encoding reduces the input to a series of operations from those two categories. Decoding executes those ops and each op emits some bytes (as well as updating other state). Concatenating those emissions reproduce the original decompressed bytes.
Here's the romeo.txt
partition of 942 input bytes as 551 literal and 391
match bytes. The '\n' new line bytes in the original text have been replaced by
'@' at signs to help distinguish them from ' ' spaces and '.' periods.
Offset Input Literals Matches
00000000 |Romeo and Juliet| |Romeo and Juliet| |----------------|
00000010 |@Excerpt from Ac| |@Excerpt from Ac| |----------------|
00000020 |t 2, Scene 2@@JU| |t 2, Scene 2@@JU| |----------------|
00000030 |LIET@O Romeo, Ro| |LIET@O -----,---| |-------00000-111|
00000040 |meo! wherefore a| |---! wherefore a| |111-------------|
00000050 |rt thou Romeo?@D| |rt thou------?@D| |-------222222---|
00000060 |eny thy father a| |eny thy father--| |--------------33|
00000070 |nd refuse thy na| |---refuse-----na| |333------44444--|
00000080 |me;@Or, if thou | |me;@Or, if------| |----------555555|
00000090 |wilt not, be but| |wilt not, be but| |----------------|
000000a0 | sworn my love,@| | sworn my love,@| |----------------|
000000b0 |And I'll no long| |And I'll no long| |----------------|
000000c0 |er be a Capulet.| |er----a Capulet.| |--6666----------|
000000d0 |@@ROMEO@[Aside] | |@@ROMEO@[Aside] | |----------------|
000000e0 |Shall I hear mor| |Shall I hear mor| |----------------|
000000f0 |e, or shall I sp| |e, or s-------sp| |-------7777777--|
00000100 |eak at this?@@JU| |eak at this?----| |------------8888|
00000110 |LIET@'Tis but th| |-----'Tis-------| |88888----9999900|
00000120 |y name that is m| |------ that is--| |000000--------11|
00000130 |y enemy;@Thou ar| |--enemy;@T------| |11--------222233|
00000140 |t thyself, thoug| |----yself,-----g| |3333------44444-|
00000150 |h not a Montague| |h---- a Montague| |-5555-----------|
00000160 |.@What's Montagu| |.@What's--------| |--------66666666|
00000170 |e? it is nor han| |-? i-----nor han| |6---77777-------|
00000180 |d, nor foot,@Nor| |d,-----foot,@Nor| |--88888---------|
00000190 | arm, nor face, | | arm-------ace--| |----9999999---00|
000001a0 |nor any other pa| |----any o-----pa| |0000-----11111--|
000001b0 |rt@Belonging to | |rt@Be----ing to | |-----2222-------|
000001c0 |a man. O, be som| |a man. O-----som| |--------33333---|
000001d0 |e other name!@Wh| |e-----------!---| |-44444445555-666|
000001e0 |at's in a name? | |-----in a-----?-| |66666----77777-8|
000001f0 |that which we ca| |-----which we c-| |88888----------9|
00000200 |ll a rose@By any| |---a rose@By----| |999---------0000|
00000210 | other name woul| |----------- woul| |00000001111-----|
00000220 |d smell as sweet| |d smell as sweet| |----------------|
00000230 |;@So Romeo would| |;@So------------| |----222222333333|
00000240 |, were he not Ro| |, were he-------| |---------4444455|
00000250 |meo call'd,@Reta| |--------'d,@Reta| |55556666--------|
00000260 |in that dear per| |in------dear per| |--777777--------|
00000270 |fection which he| |fection-------he| |-------8888888--|
00000280 | owes@Without th| | owes@Witho-----| |-----------99999|
00000290 |at title. Romeo,| |----itle.-------| |0000-----1111111|
000002a0 | doff thy name,@| |-doff-----------| |1----22222222233|
000002b0 |And for that nam| |----for---------| |3333---444444555|
000002c0 |e which is no pa| |----------------| |5556666677777888|
000002d0 |rt of thee@Take | |-- o----ee@Take | |88--9999--------|
000002e0 |all myself.@@ROM| |----m-----------| |0000-11111222222|
000002f0 |EO@I take thee a| |---I t-------- -| |222---33334444-5|
00000300 |t thy word:@Call| |---hy word:@C---| |555----------666|
00000310 | me but love, an| |--e-------------| |66-7777788888999|
00000320 |d I'll be new ba| |-------be new ba| |9900000---------|
00000330 |ptized;@Hencefor| |ptized;@Henc----| |------------1111|
00000340 |th I never will | |th I never will-| |---------------2|
00000350 |be Romeo.@@JULIE| |--------.-------| |22233333-4444444|
00000360 |T@What man art t| |------ man------| |445555----666666|
00000370 |hou that thus be| |---------thus be| |666677777-------|
00000380 |screen'd in nigh| |screen'd----nigh| |--------8888----|
00000390 |t@So stumblest o| |t----stumblest o| |-9999-----------|
000003a0 |n my counsel?@| |n my counsel?@| |--------------|
Literal ops explicitly state what bytes to emit. Zlib and Zstandard differ in how literals are represented. Zlib literal ops emit one byte at a time: emit an 'A' byte. Zstandard literal ops emit one string (multiple bytes) at a time: emit an "Apple" string.
Zstandard literal ops are simply defined by a single Literal Length number, the
length of the string. That length is combined with an offset (the sum of all
previously seen Literal Lengths) to refer to a substring of the overall
Literals string: the original input bytes minus all of the matches. Here's the
551 byte Literals string for romeo.txt.zst
, the "Literals" column above with
the "-" gaps removed:
|Romeo and Juliet@Excerpt from Act 2, Scene 2@@JULIET@O ,! wherefore a|
|rt thou?@Deny thy fatherrefusename;@Or, ifwilt not, be but sworn my l|
|ove,@And I'll no longera Capulet.@@ROMEO@[Aside] Shall I hear more, o|
|r sspeak at this?'Tis that isenemy;@Tyself,gh a Montague.@What's? ino|
|r hand,foot,@Nor armaceany opart@Being to a man. Osome!in a?which we |
|ca rose@By would smell as sweet;@So, were he'd,@Retaindear perfection|
|he [email protected] oee@Take mI t hy word:@Cebe new baptized;@H|
|encth I never will. manthus bescreen'dnightstumblest on my counsel?@|
Those 551 = 138 + 138 + 138 + 137 Literals bytes can be split into four substrings of almost equal length. In the eight row presentation immediately above, that's four strips of two rows each. Call those strips LSTRIP 1, LSTRIP 2, LSTRIP 3 and LSTRIP 4. We'll come back to them in Part 4: Huffman Codes.
Match ops copy from 'history' (previously emitted bytes). An optional
out-of-band dictionary can provide virtual history from before the first
emitted byte, but our romeo.txt.zst
example does not use a dictionary.
Match ops are defined by two numbers: Match Length and Raw Match Offset (a RMO is also called a distance and matches are therefore sometimes defined as a length/distance pair). The length, like a literal op, is the number of bytes that the op emits. The RMO is how far back in the history to start copying from. For example, if "Hamlet: To be or not to be" is all that has been emitted, a (Match Length 5, Raw Match Offset 9) op would emit "not t".
As an interesting aside, LZ77 matches are a superset of run length encoding. Starting again from "Hamlet: To be or not to be", a (Match Length 5, Raw Match Offset 1) op would emit "eeeee". This is a valid match op even though ML is greater than RMO and some of the source bytes to copy from only become known after the op starts executing. In terms of implementation, this is why LZ77 matches aren't always the same as a memcpy call.
"Raw" (and its opposite, "Cooked") are terms just for this blog post series. They're not used in the Zstandard spec and source code, but the distinction is specific to Zstandard (e.g. it's not in Zlib's LZ77 style matches). The Cooked Match Offset usually equals the Raw Match Offset plus 3, but CMO values of 1, 2 or 3 are special cases that instead mean to repeat a recently used RMO (or, at the start of the decoding, one of three "Repeat Offsets"). Look for seq60 and 622 in the table below for an example.
Zstandard represents the LZ77 ops as a series of Sequences and the table below
gives the Sequences for romeo.txt
. Each Sequence has three explicit numbers:
LL (Literal Length), ML (Match Length) and CMO (Cooked Match Offset). For
example, executing seq04 emits 6 literal and 5 match bytes, "refuse thy ",
corresponding to this slice of the "Offset Input Literals Matches" table (bytes
outside the emission have been masked by '+' plus signs):
Offset Input Literals Matches
00000070 |+++refuse thy ++| |+++refuse-----++| |+++------44444++|
Adjacent matches are represented by adjacent Sequences where the second Sequence has a zero Literal Length. The minimum ML and CMO values are 3 and 1.
SeqNum LL ML CMO RMO
seq00 55 5 58 55
seq01 1 6 10 7
seq02 20 6 36 33
seq03 17 5 108 105
seq04 6 5 25 22
seq05 12 6 59 56
seq06 50 4 44 41
seq07 49 7 25 22
seq08 14 9 227 224
seq09 4 5 128 125
seq10 0 8 167 164
seq11 8 4 139 136
seq12 8 4 177 174
seq13 0 6 242 239
seq14 6 5 195 192
seq15 2 4 192 189
seq16 19 9 20 17
seq17 3 5 77 74
seq18 9 5 13 10
seq19 13 7 22 19
seq20 3 6 13 10
seq21 5 5 322 319
seq22 7 4 252 249
seq23 15 5 307 304
seq24 4 7 45 42
seq25 0 4 349 346
seq26 1 8 127 124
seq27 4 5 21 18
seq28 1 6 204 201
seq29 10 4 288 285
seq30 9 11 108 105
seq31 0 4 66 63
seq32 25 6 480 477
seq33 0 6 34 31
seq34 9 5 251 248
seq35 0 6 28 25
seq36 0 4 89 86
seq37 10 6 118 115
seq38 15 7 134 131
seq39 13 5 371 368
seq40 0 4 399 396
seq41 5 8 614 611
seq42 4 9 559 556
seq43 0 6 515 512
seq44 3 6 45 42
seq45 0 6 169 166
seq46 0 5 77 74
seq47 0 5 341 338
seq48 0 5 291 288
seq49 2 4 51 48
seq50 8 4 228 225
seq51 1 5 420 417
seq52 0 9 542 539
seq53 3 4 29 26
seq54 0 4 39 36
seq55 1 4 114 111
seq56 10 5 48 45
seq57 1 5 509 506
seq58 0 5 625 622
seq59 0 5 690 687
seq60 0 5 1 622 LL == 0 and CMO == 1: use RMO from 2 rows up.
seq61 21 4 758 755
seq62 15 4 656 653
seq63 0 5 264 261
seq64 1 9 592 589
seq65 0 4 515 512
seq66 4 10 799 796
seq67 0 5 592 589
seq68 15 4 423 420
seq69 5 4 355 352
The fifth column provides the implicit Raw Match Offset for the explicit Cooked Match Offset. The first column provides Sequence Numbers for cross-referencing a Sequence with the "Offset Input Literals Matches" table further above. For example, seq23's match corresponds to:
Offset Input Literals Matches
000001c0 |a man. O, be som| |a man. O-----som| |--------33333---|
The Sequences' LL values here sum to 526, but the Literals string is 551 bytes long. The final 25 bytes have no explicit Sequence. They're just emitted after all the Sequences are processed.
A major part of Zstandard involves codes or machines (in the "finite state machine" sense) that read from a stream of bits and write to a stream of symbols when decoding (encoding does the opposite: reading symbols and writing bits). Each symbol is an integer in 0 .. NumSyms (the "LOW .. HIGH" range notation instead of "LOW ..= HIGH" means that the low bound is inclusive but the high bound is exclusive), for some value of NumSyms. If NumSyms is 256 then symbols are equivalent to bytes, but NumSyms does not have to be a power of two.
Zstandard uses two types of these machines: Huffman codes and FSE (Finite State Entropy) codes. They do similar jobs but both are useful. According to Zstandard's primary inventor Yann Collet, Huffman is faster but FSE is smaller (better compression ratio), generally speaking.
Huffman codes are conceptually trees but efficient decoders are implemented by tables. Efficient FSE decoders are also table based. This blog post series will refer to both types of decoder machines as tables.
Decoding FSE symbols (remember that S stands for State) also involves updating an AL-bit unsigned integer, for some small value of AL (also called its Accuracy Log, where Log means a base 2 logarithm). Each bitstream starts with the AL-bit initial state.
Within a given Zstandard file, any one table can be applied independently to multiple bitstreams to produce multiple symbol streams. Tables can also be shared across multiple Zstandard files, since Zstandard dictionaries contain their own re-usable tables.
Next: Part 2: Structure.
Published: 2022-05-11