-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fix for sv extension #591
Add fix for sv extension #591
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #591 +/- ##
==========================================
+ Coverage 83.46% 83.48% +0.01%
==========================================
Files 74 74
Lines 16858 16884 +26
Branches 16858 16884 +26
==========================================
+ Hits 14071 14096 +25
- Misses 2129 2131 +2
+ Partials 658 657 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
kernel/src/table_changes/scan.rs
Outdated
@@ -314,7 +315,14 @@ fn read_scan_file( | |||
// trying to return a captured variable. We're going to reassign `selection_vector` | |||
// to `rest` in a moment anyway | |||
let mut sv = selection_vector.take(); | |||
let rest = split_vector(sv.as_mut(), len, None); | |||
|
|||
// If this is a resolve_dv_pair, we extend with false because we only want to select rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like resolve_dv_pair
is determining if this is a selection vs deletion vector. That's confusing. Can we just make it a selection vector always?
I get that up-front we might not be able to, but could ResolvedCdfScanFile
make selection_vector
private (and maybe call it something else since it can be both), and then have a selection_vector()
method that "does the right thing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like resolve_dv_pair is determining if this is a selection vs deletion vector.
Not quite. Both are selection vectors, its just that the way they're extended is different.
Consider data with 5 rows. Suppose in case A we have a deletion treemap with 1 set Treemap_d(1)
. This is converted into a selection vector [1, 0, 1]
. (this is what happens in selection_treemap_to_bools
).
For 5 rows this really translates into a selection vector: [1, 0, 1, 1, 1]
, since nothing else was deleted, we select all the rows that are greater than sv.length.
Consider the same 5 data rows, and we resolved the selection map to be Treemap_s(1). This is converted into a selection vector [0, 1]
. But this selection vector only wants to select element 1. Nothing else. So this really translates to selection vector [0, 1, 0, 0, 0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, okay. I'd still vote to hide this inside an extend
or similar method on ResolvedCdfScanFile
so others don't have to write this code.
What changes are proposed in this pull request?
This is a PR fixes the behaviour for CDF on tables with deletion vectors.
When computing selection vectors from add/remove dv pairs, the selection vector must be extended with false.
When computing selection vectors from a normal dv, we extend the selection vector with true.
How was this change tested?
I create a table which adds and removes deletion vectors whose lengths are shorter than the total number of rows. The test checks:
All of these rows are less than the size of the table, so they exercise the extension behaviour for selection vectors.