Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fix for sv extension #591

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 11, 2024

What changes are proposed in this pull request?

This is a PR fixes the behaviour for CDF on tables with deletion vectors.

When computing selection vectors from add/remove dv pairs, the selection vector must be extended with false.

When computing selection vectors from a normal dv, we extend the selection vector with true.

How was this change tested?

I create a table which adds and removes deletion vectors whose lengths are shorter than the total number of rows. The test checks:

  • Deleting rows when none were deleted.
  • deleting rows when some were deleted
  • restoring some of the rows that have been deleted
  • restoring some rows and removing another
  • restoring all rows.
    All of these rows are less than the size of the table, so they exercise the extension behaviour for selection vectors.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.48%. Comparing base (be1453f) to head (26e4cdf).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #591      +/-   ##
==========================================
+ Coverage   83.46%   83.48%   +0.01%     
==========================================
  Files          74       74              
  Lines       16858    16884      +26     
  Branches    16858    16884      +26     
==========================================
+ Hits        14071    14096      +25     
- Misses       2129     2131       +2     
+ Partials      658      657       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@@ -314,7 +315,14 @@ fn read_scan_file(
// trying to return a captured variable. We're going to reassign `selection_vector`
// to `rest` in a moment anyway
let mut sv = selection_vector.take();
let rest = split_vector(sv.as_mut(), len, None);

// If this is a resolve_dv_pair, we extend with false because we only want to select rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like resolve_dv_pair is determining if this is a selection vs deletion vector. That's confusing. Can we just make it a selection vector always?

I get that up-front we might not be able to, but could ResolvedCdfScanFile make selection_vector private (and maybe call it something else since it can be both), and then have a selection_vector() method that "does the right thing".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like resolve_dv_pair is determining if this is a selection vs deletion vector.

Not quite. Both are selection vectors, its just that the way they're extended is different.

Consider data with 5 rows. Suppose in case A we have a deletion treemap with 1 set Treemap_d(1). This is converted into a selection vector [1, 0, 1]. (this is what happens in selection_treemap_to_bools).
For 5 rows this really translates into a selection vector: [1, 0, 1, 1, 1], since nothing else was deleted, we select all the rows that are greater than sv.length.

Consider the same 5 data rows, and we resolved the selection map to be Treemap_s(1). This is converted into a selection vector [0, 1]. But this selection vector only wants to select element 1. Nothing else. So this really translates to selection vector [0, 1, 0, 0, 0].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, okay. I'd still vote to hide this inside an extend or similar method on ResolvedCdfScanFile so others don't have to write this code.

@OussamaSaoudi-db OussamaSaoudi-db merged commit 702e12f into delta-io:main Dec 12, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants