-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve deletion vectors to find inserted and removed rows for CDF #568
Resolve deletion vectors to find inserted and removed rows for CDF #568
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #568 +/- ##
==========================================
+ Coverage 82.45% 82.85% +0.39%
==========================================
Files 72 73 +1
Lines 16054 16426 +372
Branches 16054 16426 +372
==========================================
+ Hits 13238 13609 +371
+ Misses 2173 2172 -1
- Partials 643 645 +2 ☔ View full report in Codecov by Sentry. |
170268b
to
5def6bd
Compare
5def6bd
to
6b133d4
Compare
let dv_treemap = self | ||
.deletion_vector | ||
) -> DeltaResult<Option<RoaringTreemap>> { | ||
self.deletion_vector | ||
.as_ref() | ||
.map(|dv_descriptor| { | ||
let fs_client = engine.get_file_system_client(); | ||
dv_descriptor.read(fs_client, table_root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking: dv_descriptor.read
takes care of slicing the DV when offset and size cover a subset of the DV file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it does. It just handles reading the on-disk file and turning it into a roaring treemap. Something at a higher level needs to slice it up if needed.
/// must be [`CDFScanFileType::Add`]. In this case, both the add and remove deletion vectors are | ||
/// read if they exist. Then, we find the set of rows in the scan file that have been added, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we find the add DV? I thought a previous PR in the stack only kept the remove DV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: the add dv is directly attached to the CDFScanFile
But see also #546 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove_dv is now part of CDFScanFile, and I made the changes to the visitor that you recommended.
.into_iter() | ||
.map(Not::not) | ||
.collect(); | ||
println!("got dv: {:?} for type {:?}", added_dv, out_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove or turn to e.g. debug!
(_, Some(_)) => Err(Error::generic( | ||
"Remove DV should only match to an add action!", | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great reason to embed the remove dv info in the scan file alongside the add dv info...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're saying is instead of having UnresolvedCdfScanFile
, we move remove_dvs
directly into CdfScanFile
?
I'll go ahead and implement that, and you lmk
} = scan_file; | ||
let paired_rm_dv = remove_dvs.get(&scan_file.path); | ||
match (&scan_file.scan_type, paired_rm_dv) { | ||
(CDFScanFileType::Add, Some(rm_dv)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we embedded the rm_dv in the scan_file itself, then this whole match can go away.
We just need the validation code that asserts:
- cdc cannot have a remove
- remove cannot have an add
(add can have one or both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented as part of the simplification.
println!("got dv: {:?} for type {:?}", added_dv, out_type); | ||
scan_file.scan_type = out_type; | ||
|
||
Either::Right(iter::once(ResolvedCDFScanFile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Either<iter::Empty<T>, iter::Once<T>>
is just a complicated way to emulate Option<T>
...
if selection_treemap.is_empty() { | ||
// Nothing has been selected, we do not read this data file | ||
Either::Left(iter::empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem: The DV for a file can be shorter than the file's row count, with missing entries treated as if they were true
(= keep). See doc comments for ScanResult::full_mask. My tired Friday pm brain is struggling to grok whether computing a symmetric difference for a remove DV breaks that assumption? But it's definitely not safe to skip file just because its raw physical DV is empty (current code in PR), or even just because all DV entries seem to be false (because every DV has an implied padding of true entries past-end, out to the data length which we won't know until we have an EngineData
to work with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I may be skipping something, but here's what I know:
- This skipping here only happens for add/remove pairs of deletion vectors.
- the treemap here represents the selected rows based on the dv. So far in our code we've treated treemaps exclusively as deletion bitmaps. Our dv resolving results in us generating a selection treemap. Hence if the selection treemap no bit set, everything would be false => not selected/skipped
I still think should be safe to skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment explaining why the treemap is treated as a selection vector .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is, let's say your data is:
['a', 'b', 'c']
and your DV is:
[0] // row 0 is deleted
the implied DV is [false, true, true]
.
This matters when you're just using DVs "as is". However, here you're only look for rows that changed, so if there are a bunch of indices not mentioned in either treemap, they did not change, so should not be returned here.
I think your comment below should work through with the indices first, since that's what the treemap has, and the set difference operations to show the resulting indicies, and then the conversion to boolean selection vectors, to ensure things work as we expect.
Re it being definitely wrong if the vector is empty, that's certainly true for a standard scan, but I think here it's potentially correct if it means that nothing changed between the two vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to add on. All the important operations here occur where things are just treemaps. Only once we've settled on the final indicies which have changed do we convert to a boolean vector, which is the same as in our scan code. But, because we're using selection vectors here, we don't need to worry about extending, because the implied extra stuff is masked out.
We should probably put a bit // ===== IMPORTANT =====
comment somewhere that calls out that a CDF scan uses a selection vector while a "normal" scan uses a deletion one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hopefully explained it more thoroughly this time. Added some notation to be clear at each point what type of treemap we're working with. Also split up selection vector generation into its own section.
// Helper function to convert a treemap to a [`ResolvedCdfScanFile`]. The `scan_type` | ||
// of the [`ResolvedCdfScanFile`] is set to `out_type` This returns an empty iterator | ||
// if nothing is selected. | ||
fn treemap_to_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We map scan file to iter because we could have 0 (filtered out) 1 (normal) or 2 (adds+removes) results? Which in turn requires two Either
so we can capture four possible cases? I wonder if it would be simpler to always produce the iterators and let the caller flatten() them?
Something like:
let add_dv = scan_file.add_dv.dv_info.get_treemap(...)?:
let rm_dv = scan_file.rm_dv.dv_info.get_treemap(...)?:
let (add_dv, rm_dv) = match (scan_file.add_dv, scan_file.rm_dv, scan_file.scan_type) {
(Some(_), _, CDFScanFileType::Remove) => {
return Err(... remove cannot have add DV ...);
}
(_, Some(_), CDFScanFileType::Cdc) => {
return Err(... cdc cannot have remove DV ...);
}
(Some(add_dv), Some(rm_dv), _) => {
// Take the symmetric difference so we don't double count rows
(Some(&add_dv - &rm_dv), Some(&rm_dv - &add_dv))
}
(add_dv, None, CDFScanFileType::Add | CDFScanFileType::Cdc) => {
(Some(add_dv.get_or_else(|| Default::default())), None)
}
(None, rm_dv, CDFScanFileType::Remove) => {
(None, Some(rm_dv.get_or_else(|| Default::default())))
}
};
let adds = add_dv.map(treemap_to_bools)
.map(|sel| ResolvedCDFScanFile { ... });
let removes = rm_dv.map(treemap_to_bools)
.map(|sel| sel.into_iter().map(Not::not).collect())
.map(|sel| ResolvedCDFScanFile { ... });
[adds, removes].into_iter().flatten()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... except I think I got the selection vector inversion bit wrong; it should only be inverted if it was a remove DV on an Add action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding treemap_to_bools
: I noticed that it initializes the vec with all true
be default, then sets elements to false
. I observed that when I initialized the vec with all false
and only selected elements to true
, it's significantly faster. I found a post explaining why this may be. Zeroed out pages seem to work well with the kernel :)
Given that the selection vector is O(#rows)
it may worth avoiding the overhead of iterating over it again with Not::not
and take advantage of the zeroed out allocation.
I'm assuming here #rows
could be very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across a weird case 🤔. The way you have it, we seem to have
CdfScanFile {
add_dv: DvInfo,
rm_dv: DvInfo
...
}
where the deletion vector in add_dv/rm_dv could be None
.
The problem comes in when differentiating the following cases:
[Remove(path=A, dv=None), Add(path=A, dv=X)]
[Add(path=A, dv=X)]
.
This has got me thinking whether case 2 would ever happen. Suppose we want to handle it:
I think both of these would fall into this case:
(add_dv, None, CDFScanFileType::Add | CDFScanFileType::Cdc) => {
(Some(add_dv.get_or_else(|| Default::default())), None)
}
So I'm instead electing to go with something like this:
CdfScanFile {
deletion_vector: DvInfo,
remove_dv: Option<DvInfo>
...
}
so that the cases become clear:
remove_dv = Some(DvInfo( deletion_vector: None))
remove_dv = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can totally imagine zero-initialized is more efficient for many reasons (tho correctness trumps performance, if there is friction between the two).
Your newer approach to handling add vs. rm DV makes a lot of sense, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ryan's proposed simplification makes sense. otherwise looks pretty good. just re-request when that's resolved.
22196d7
to
027a73a
Compare
720171b
to
fa13ade
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits, but looks great
kernel/src/scan/state.rs
Outdated
DvInfo { | ||
deletion_vector: Some(deletion_vector), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clippy wrangling nit:
DvInfo { | |
deletion_vector: Some(deletion_vector), | |
} | |
let deletion_vector = Some(deletion_vector); | |
DvInfo { deletion_vector } |
// | ||
// The result of this commit is: | ||
// - row 0 is restored | ||
// - row 1 is unchanged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - row 1 is unchanged | |
// - row 1 is unchanged (previously deleted) |
let rm_scan_file = CdfScanFile { | ||
scan_type: CdfScanFileType::Remove, | ||
..scan_file.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This only gets used if we have Some rm_dv
? Maybe rust is smart enough to delay creating it until we actually use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the add/rm paths below are the ~same code duplicated. Maybe we could solve both issues in one swoop with a helper method?
let resolve = |dv, scan_file_fn| dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
scan_file: scan_file_fn(),
sselection_vector: (!sv.is_emtpy()).then_some(sv),
});
let removes = resolve(rm_dv, || CdfScanFile { ... });
let adds = resolve(add_dv, || scan_file);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm rust has a tricky type system. One of them is Fn() -> CdfScanFile
while the other one is FnOnce. So I have to do a clone()
in the ||scan_file.clone()
closure to make it Fn. Also the clone may be optimized by the compiler, but I'm not sure.
I implemented something that I think lazily constructs the rm scan file and helps a bit with the duplication.
If you have tips on viewing optimizations for rust, please let me know! I'm familiar with GodBolt, but I feel like assembly is too low level for the optimizations to be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you probably just needed to do move || scan_file
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the problem is that move || scan_file
is FnOnce
, while the other one is Fn
. move | | scan_file
is FnOnce
because we can only return the captured owned value once. Meanwhile || CdfScanFile { ... }
clones its reference to scan_file
so it can be called repeatedly. Thus it's Fn
We could make the closure take scan_file_fn: &dyn Fn() -> CdfScanFile
, but then we'd have to use | | scan_file.clone()
for the add case.
let resolve = |dv: Option<RoaringTreemap>, scan_file_fn: &dyn Fn() -> CdfScanFile| {
dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
scan_file: scan_file_fn(),
selection_vector: (!sv.is_empty()).then_some(sv),
})
};
let removes = resolve(rm_dv, &|| CdfScanFile {
scan_type: CdfScanFileType::Remove,
..scan_file.clone()
});
let adds = resolve(add_dv, &|| scan_file.clone());
I tried scan_file_fn: &dyn FnOnce() -> CdfScanFile
but the compiler isn't happy because you need to consume ownership of the function to call a FnOnce
let resolve = |dv: Option<RoaringTreemap>, scan_file_fn: &dyn FnOnce() -> CdfScanFile| {
dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
scan_file: scan_file_fn(),
selection_vector: (!sv.is_empty()).then_some(sv),
})
};
let removes = resolve(rm_dv, &|| CdfScanFile {
scan_type: CdfScanFileType::Remove,
..scan_file.clone()
});
let adds = resolve(add_dv, &|| scan_file);
Ok([removes, adds].into_iter().flatten())
Moreover generics don't work with closures, so we'd need a proper function fn resolve<T>(dv: ..., scan_file_fn: T)
,
Perhaps there's a way to specify generic trait impls so that we monomorphize into two resolves: resolve::<impl FnOnce>
, and resolve::<impl Fn>
but by this point I felt like this added more lines and complexity than it was worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! 🥳
/// read the deletion vector (if present), and each is converted into a [`ResolvedCdfScanFile`]. | ||
/// No changes are made to the `scan_type`. | ||
#[allow(unused)] | ||
fn resolve_scan_file_dv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's nice that this actually ended up pretty short :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments but let's iterate on this!
@@ -402,6 +414,26 @@ mod tests { | |||
assert_eq!(bools, expected); | |||
} | |||
|
|||
#[test] | |||
fn test_sv_to_bools() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive is this to run? if cheap, remove #[ignore]
from the one above? if expensive should we ignore this one too? or actually if this is relatively expensive but we decide to keep it, maybe we should integrate it with the test above? (just check both apis in the same test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that it ran just fine on my machine, whereas the other one took a minute.
Giving an algorithmic complexity is hard to say tbh. I go into it in this comment, but the TLDR is that the zero-initialized large sv vector is fast to allocate. Just get a bunch of pages from the OS. The other one is slow because rust has to actually read the pages and set it to true.
I can disable it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh right i recall. how about we leave it enabled then and just leave that comment here in the code with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is pretty staggering lol. 0.85s vs 40s on my machine!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let's not keep the slow one. kinda nice we got some coverage through the other path now tho :)
/// types of `CdfScanFile`s: | ||
/// 1. The first case is a [`CdfScanFile`] paired with a remove deletion vector. The `scan_type` | ||
/// must be [`CdfScanFileType::Add`]. In this case, both the add and remove deletion vectors are | ||
/// read if they exist. Then, we find the set of rows in the that have been added, and rows that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? "we find the set of rows in the that have been added"
// vector pairs, we generate selection treemaps. | ||
// | ||
// We use a motivating example to explain the deletion vector resolution. We read | ||
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default. | |
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default |
let adds = &rm_dv - &add_dv; | ||
let removes = add_dv - rm_dv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, do we know the time complexity of these operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let n be size of LHS map, m be the size of RHS map. This is O(m * log(n) + n)
. This is a set difference, so it iterates over everything in the RHS (O(m)), and for each such element, removes it from the LHS O(log(n)). The adds case has to do a clone of the LHS, so we also add an O(n).
let treemap_to_bools = if scan_file.remove_dv.is_some() { | ||
selection_treemap_to_bools | ||
} else { | ||
deletion_treemap_to_bools | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we normalize above (i.e. everything to SV) and avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. Treemaps are sets, so constructing the inverse would require that we create a new "full" treemap containing every index (ie the number of rows in the data file). Then we take the full treemap and subtract out the selection treemap.
I think in most cases the set of deleted rows is much smaller than the set of all rows. So constructing such a treemap would be a lot more expensive.
let adds = add_dv | ||
.map(treemap_to_bools) | ||
.map(|sv| resolve(scan_file, sv)); | ||
Ok([removes, adds].into_iter().flatten()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if the iterator is the right approach? we only ever return one or two items, correct? does the (future) callsite necessitate the iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that we only return one or two items.
The callsite doesn't necessitate it, but it works nicely with flattening:
.map(move |scan_file| {
resolve_scan_file_dv(dv_engine_ref.as_ref(), &table_root, scan_file?)
}) // Iterator-Result-Iterator
If you know of a better way tho, lmk 👀
(and i probably need to take a closer look at tests) |
What changes are proposed in this pull request?
This adds a function
resolve_scan_file_dv
which convertsCdfScanFile
toResolvedCdfScanFile
by reading deletion vectors, and reconciling paired add/remove deletion vectors.ResolvedCdfScanFile
is a struct that holds aCdfScanFile
with its selection vector. It is guaranteed that thescan_type
ofCdfScanFile
will match the_change_type
of all the selected rows in itspath
if it is type add or remove.treemap_to_bools
is split intodeletion_treemap_to_bools
andselection_treemap_to_bools
, which generate selection vectors from either a deletion treemap or selection treemap respectively.get_selection_vector
now depends on a new helper functionget_treemap
.How was this change tested?
The following are tested:
None
selection vector.selection_treemap_to_bool