Skip to content

Commit

Permalink
Add fix for sv extension (#591)
Browse files Browse the repository at this point in the history
## What changes are proposed in this pull request?
This is a PR fixes the behaviour for CDF on tables with deletion
vectors.

When computing selection vectors from add/remove dv pairs, the selection
vector must be extended with false.

When computing selection vectors from a normal dv, we extend the
selection vector with true.
## How was this change tested?
I create a table which adds and removes deletion vectors whose lengths
are shorter than the total number of rows. The test checks:
- deleting rows when none were deleted. 
- deleting rows when some were deleted
- restoring some of the rows that have been deleted
- restoring some rows and removing another
- restoring all rows.
All of these rows are less than the size of the table, so they exercise
the extension behaviour for selection vectors.
  • Loading branch information
OussamaSaoudi-db authored Dec 12, 2024
1 parent be1453f commit 702e12f
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 2 deletions.
28 changes: 27 additions & 1 deletion kernel/src/table_changes/scan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,8 @@ fn read_scan_file(
physical_to_logical_expr,
global_state.logical_schema.clone().into(),
);
// Determine if the scan file was derived from a deletion vector pair
let is_dv_resolved_pair = scan_file.remove_dv.is_some();

let table_root = Url::parse(&global_state.table_root)?;
let location = table_root.join(&scan_file.path)?;
Expand All @@ -314,7 +316,31 @@ fn read_scan_file(
// trying to return a captured variable. We're going to reassign `selection_vector`
// to `rest` in a moment anyway
let mut sv = selection_vector.take();
let rest = split_vector(sv.as_mut(), len, None);

// Gets the selection vector for a data batch with length `len`. There are three cases to
// consider:
// 1. A scan file derived from a deletion vector pair getting resolved.
// 2. A scan file that was not the result of a resolved pair, and has a deletion vector.
// 3. A scan file that was not the result of a resolved pair, and has no deletion vector.
//
// # Case 1
// If the scan file is derived from a deletion vector pair, its selection vector should be
// extended with `false`. Consider a resolved selection vector `[0, 1]`. Only row 1 has
// changed. If there were more rows (for example 4 total), then none of them have changed.
// Hence, the selection vector is extended to become `[0, 1, 0, 0]`.
//
// # Case 2
// If the scan file has a deletion vector but is unpaired, its selection vector should be
// extended with `true`. Consider a deletion vector with row 1 deleted. This generates a
// selection vector `[1, 0, 1]`. Only row 1 is deleted. Rows 0 and 2 are selected. If there
// are more rows (for example 4), then all the extra rows should be selected. The selection
// vector becomes `[1, 0, 1, 1]`.
//
// # Case 3
// These scan files are either simple adds, removes, or cdc files. This case is a noop because
// the selection vector is `None`.
let extend = Some(!is_dv_resolved_pair);
let rest = split_vector(sv.as_mut(), len, extend);
let result = ScanResult {
raw_data: logical,
raw_mask: sv,
Expand Down
20 changes: 19 additions & 1 deletion kernel/tests/cdf.rs
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,14 @@ fn read_cdf_for_table(
#[test]
fn cdf_with_deletion_vector() -> Result<(), Box<dyn error::Error>> {
let batches = read_cdf_for_table("cdf-table-with-dv", 0, None)?;
// Each commit performs the following:
// 0. Insert 0..=9
// 1. Remove [0, 9]
// 2. Restore [0, 9]
// 3. Remove [0, 1, 4, 5]
// 4. Restore [1, 4]
// 5. Restore [0, 5] and Remove [3]
// 6. Restore 3
let mut expected = vec![
"+-------+--------------+-----------------+",
"| value | _change_type | _commit_version |",
Expand All @@ -65,13 +73,23 @@ fn cdf_with_deletion_vector() -> Result<(), Box<dyn error::Error>> {
"| 4 | insert | 0 |",
"| 5 | insert | 0 |",
"| 6 | insert | 0 |",
"| 8 | insert | 0 |",
"| 7 | insert | 0 |",
"| 8 | insert | 0 |",
"| 9 | insert | 0 |",
"| 0 | delete | 1 |",
"| 9 | delete | 1 |",
"| 0 | insert | 2 |",
"| 9 | insert | 2 |",
"| 0 | delete | 3 |",
"| 1 | delete | 3 |",
"| 4 | delete | 3 |",
"| 5 | delete | 3 |",
"| 1 | insert | 4 |",
"| 4 | insert | 4 |",
"| 3 | delete | 5 |",
"| 0 | insert | 5 |",
"| 5 | insert | 5 |",
"| 3 | insert | 6 |",
"+-------+--------------+-----------------+",
];
sort_lines!(expected);
Expand Down
Binary file modified kernel/tests/data/cdf-table-with-dv.tar.zst
Binary file not shown.

0 comments on commit 702e12f

Please sign in to comment.