-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
For https://linear.app/xet/issue/XET-246/fragmentation-prevention We use average chunks / range as a fragmentation estimator, targetting an average of 16 chunks per range which roughly equates to 1MB per range. This is computed over the last window of 32 ranges. If the average drops below the target, dedupe is disabled until the average is above the target again. Running on first 1GB of a *highly* fragmented file (comprising of a few hundred KB of an existing file, followed by a hundred KB of zeros, repeat) we see the following: - Baseline: 1000000001 bytes -> 726845953 bytes, 2975 ranges, 336134 average bytes per range - 512KB target (anti-fragmentation goal of 8 chunk per range): 1000000001 bytes -> 873515521 bytes, 1465 ranges, 682594 average bytes per range - 1MB target (anti-fragmentation goal of 16 chunks per range): 1000000001 bytes -> 932235777 bytes, 829 ranges, 1206273 average bytes per range This also includes a hysteresis implementation: - 512KB target (anti-fragmentation goal of 8 chunk per range): 1000000001 bytes -> 873515521 bytes, 1657 ranges, 603500 average bytes per range. The hysteresis turned out to be pretty important for deduping a content defined chunked variant of Parquet: Without hysteresis (only concern is how v2 dedupes against v1): ``` parquet file v1: 5728317968 bytes -> 5728137283 bytes parquet file v2: 5726717793 bytes -> 4544391399 bytes (11.14 chunks per range) ``` With hysteresis ``` parquet file v1: 5728317968 bytes -> 5728137283 bytes parquet file v2: 5726717793 bytes -> 3568275084 bytes (8.11 chunks per range) ``` So with the hysteresis implementation we are closer to the target chunk per range and we are able to still dedupe pretty well. As comparison, *without* any fragmentation prevention: ``` parquet file v1: 5728317968 bytes -> 5728137283 bytes parquet file v2: 5726717793 bytes -> 3402767500 bytes (6.89 chunks per segment) ```
- Loading branch information
Showing
2 changed files
with
98 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters