-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Dictionary Compression
Dynamic dictionary compression algorithms have trouble compressing small data. With default usage, the compression dictionary starts off empty and is constructed during a single pass over the input data. Thus small input leads to a small dictionary, which cannot achieve good compression ratios.
In RocksDB, each block in a block-based table (SST file) is compressed individually. Block size defaults to 4KB, from which the compression algorithm cannot build a sizable dictionary. The dictionary compression feature presets the dictionary with data sampled from multiple blocks, which improves compression ratio when there are repetitions across blocks.
Set rocksdb::CompressionOptions::max_dict_bytes
(in include/rocksdb/options.h
) to a nonzero value indicating the maximum per-file dictionary size.
Also rocksdb::CompressionOptions::zstd_max_train_bytes
may be used to generate a training dictionary of max bytes for ZStd compression. Using ZStd's dictionary trainer can achieve even better compression ratio improvements than using max_dict_bytes
alone. The training data will be used to generate a dictionary of max_dict_bytes
.
The dictionary will be constructed for each SST file in a subcompaction by sampling entire data blocks in the file. When dictionary compression is enabled, the uncompressed data blocks in the file being generated will be buffered in memory, upto target_file_size
bytes. Once the limit is reached, or the file is finished, data blocks are taken uniformly/randomly from the buffered data blocks and used to train the ZStd dictionary trainer.
The dictionary is stored in the file's meta-block in order for it to be known when uncompressing. During reads, if BlockBasedTableOptions::cache_index_and_filter_blocks
is false
, the dictionary meta-block is read and pinned in memory but not charged to the block cache. If it is true
, the dictionary meta-block is cached in the block cache, with high priority if cache_index_and_filter_blocks_with_high_priority
is true
, and with low priority otherwise. As for prefetching and pinning the dictionary meta-block in the cache, the behavior depends on the RocksDB version. In RocksDB versions lower than 6.4, the dictionary meta-block is not prefetched or pinned in the cache, unlike index and filter blocks. Starting RocksDB version 6.4, the dictionary meta-block is prefetched and pinned in the block cache the same way as index/filter blocks (for instance, setting pin_l0_filter_and_index_blocks_in_cache
to true
results in the dictionary meta-block getting pinned as well for L0 files).
The in-memory uncompression dictionary is a digested form of the raw dictionary stored on disk, and is larger in size. The digested form makes uncompression faster, but does consume more memory.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc