Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve deletion vectors to find inserted and removed rows for CDF #568

Merged
merged 34 commits into from
Dec 10, 2024

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 6, 2024

What changes are proposed in this pull request?

This adds a function resolve_scan_file_dv which converts CdfScanFile to ResolvedCdfScanFile by reading deletion vectors, and reconciling paired add/remove deletion vectors.

ResolvedCdfScanFile is a struct that holds a CdfScanFile with its selection vector. It is guaranteed that the scan_type of CdfScanFile will match the _change_type of all the selected rows in its path if it is type add or remove.

treemap_to_bools is split into deletion_treemap_to_bools and selection_treemap_to_bools, which generate selection vectors from either a deletion treemap or selection treemap respectively.

get_selection_vector now depends on a new helper function get_treemap.

How was this change tested?

The following are tested:

  • adding a deletion vector to a file that did not have one
  • Removing a deletion vector from a file that had one
  • restoring a subset of deleted rows
  • deleting a subset of remaining rows after a deletion already happened
  • Deleting some rows and restoring others
  • Remove_dv on type cdc and remove fails
  • Add, cdc, and rm with no deletion vector return a None selection vector.
  • ensure that the correct selection vector is generated for selection_treemap_to_bool

Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 98.69452% with 5 lines in your changes missing coverage. Please review.

Project coverage is 82.85%. Comparing base (ed714c5) to head (5767bf1).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/resolve_dvs.rs 99.40% 0 Missing and 2 partials ⚠️
kernel/src/actions/deletion_vector.rs 96.42% 1 Missing ⚠️
kernel/src/scan/mod.rs 0.00% 1 Missing ⚠️
kernel/src/scan/state.rs 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #568      +/-   ##
==========================================
+ Coverage   82.45%   82.85%   +0.39%     
==========================================
  Files          72       73       +1     
  Lines       16054    16426     +372     
  Branches    16054    16426     +372     
==========================================
+ Hits        13238    13609     +371     
+ Misses       2173     2172       -1     
- Partials      643      645       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Dec 6, 2024
let dv_treemap = self
.deletion_vector
) -> DeltaResult<Option<RoaringTreemap>> {
self.deletion_vector
.as_ref()
.map(|dv_descriptor| {
let fs_client = engine.get_file_system_client();
dv_descriptor.read(fs_client, table_root)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking: dv_descriptor.read takes care of slicing the DV when offset and size cover a subset of the DV file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it does. It just handles reading the on-disk file and turning it into a roaring treemap. Something at a higher level needs to slice it up if needed.

Comment on lines 16 to 18
/// must be [`CDFScanFileType::Add`]. In this case, both the add and remove deletion vectors are
/// read if they exist. Then, we find the set of rows in the scan file that have been added, and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we find the add DV? I thought a previous PR in the stack only kept the remove DV?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: the add dv is directly attached to the CDFScanFile

But see also #546 (comment)

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove_dv is now part of CDFScanFile, and I made the changes to the visitor that you recommended.

.into_iter()
.map(Not::not)
.collect();
println!("got dv: {:?} for type {:?}", added_dv, out_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove or turn to e.g. debug!

Comment on lines 109 to 97
(_, Some(_)) => Err(Error::generic(
"Remove DV should only match to an add action!",
)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great reason to embed the remove dv info in the scan file alongside the add dv info...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're saying is instead of having UnresolvedCdfScanFile, we move remove_dvs directly into CdfScanFile?

I'll go ahead and implement that, and you lmk

} = scan_file;
let paired_rm_dv = remove_dvs.get(&scan_file.path);
match (&scan_file.scan_type, paired_rm_dv) {
(CDFScanFileType::Add, Some(rm_dv)) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we embedded the rm_dv in the scan_file itself, then this whole match can go away.
We just need the validation code that asserts:

  • cdc cannot have a remove
  • remove cannot have an add

(add can have one or both)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as part of the simplification.

println!("got dv: {:?} for type {:?}", added_dv, out_type);
scan_file.scan_type = out_type;

Either::Right(iter::once(ResolvedCDFScanFile {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Either<iter::Empty<T>, iter::Once<T>> is just a complicated way to emulate Option<T>...

Comment on lines 48 to 50
if selection_treemap.is_empty() {
// Nothing has been selected, we do not read this data file
Either::Left(iter::empty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem: The DV for a file can be shorter than the file's row count, with missing entries treated as if they were true (= keep). See doc comments for ScanResult::full_mask. My tired Friday pm brain is struggling to grok whether computing a symmetric difference for a remove DV breaks that assumption? But it's definitely not safe to skip file just because its raw physical DV is empty (current code in PR), or even just because all DV entries seem to be false (because every DV has an implied padding of true entries past-end, out to the data length which we won't know until we have an EngineData to work with).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I may be skipping something, but here's what I know:

  • This skipping here only happens for add/remove pairs of deletion vectors.
  • the treemap here represents the selected rows based on the dv. So far in our code we've treated treemaps exclusively as deletion bitmaps. Our dv resolving results in us generating a selection treemap. Hence if the selection treemap no bit set, everything would be false => not selected/skipped

I still think should be safe to skip?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment explaining why the treemap is treated as a selection vector .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is, let's say your data is:

['a', 'b', 'c'] and your DV is:
[0] // row 0 is deleted

the implied DV is [false, true, true].

This matters when you're just using DVs "as is". However, here you're only look for rows that changed, so if there are a bunch of indices not mentioned in either treemap, they did not change, so should not be returned here.

I think your comment below should work through with the indices first, since that's what the treemap has, and the set difference operations to show the resulting indicies, and then the conversion to boolean selection vectors, to ensure things work as we expect.

Re it being definitely wrong if the vector is empty, that's certainly true for a standard scan, but I think here it's potentially correct if it means that nothing changed between the two vectors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to add on. All the important operations here occur where things are just treemaps. Only once we've settled on the final indicies which have changed do we convert to a boolean vector, which is the same as in our scan code. But, because we're using selection vectors here, we don't need to worry about extending, because the implied extra stuff is masked out.

We should probably put a bit // ===== IMPORTANT ===== comment somewhere that calls out that a CDF scan uses a selection vector while a "normal" scan uses a deletion one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hopefully explained it more thoroughly this time. Added some notation to be clear at each point what type of treemap we're working with. Also split up selection vector generation into its own section.

// Helper function to convert a treemap to a [`ResolvedCdfScanFile`]. The `scan_type`
// of the [`ResolvedCdfScanFile`] is set to `out_type` This returns an empty iterator
// if nothing is selected.
fn treemap_to_iter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We map scan file to iter because we could have 0 (filtered out) 1 (normal) or 2 (adds+removes) results? Which in turn requires two Either so we can capture four possible cases? I wonder if it would be simpler to always produce the iterators and let the caller flatten() them?

Something like:

let add_dv = scan_file.add_dv.dv_info.get_treemap(...)?:
let rm_dv = scan_file.rm_dv.dv_info.get_treemap(...)?:
let (add_dv, rm_dv) = match (scan_file.add_dv, scan_file.rm_dv, scan_file.scan_type) {
    (Some(_), _, CDFScanFileType::Remove) => {
        return Err(... remove cannot have add DV ...);
    }
    (_, Some(_), CDFScanFileType::Cdc) => {
        return Err(... cdc cannot have remove DV ...);
    }
    (Some(add_dv), Some(rm_dv), _) => {
        // Take the symmetric difference so we don't double count rows
        (Some(&add_dv - &rm_dv),  Some(&rm_dv - &add_dv))
    }
    (add_dv, None, CDFScanFileType::Add | CDFScanFileType::Cdc) => {
        (Some(add_dv.get_or_else(|| Default::default())), None)
    }
    (None, rm_dv, CDFScanFileType::Remove) => {
        (None, Some(rm_dv.get_or_else(|| Default::default())))
    }
};

let adds = add_dv.map(treemap_to_bools)
    .map(|sel| ResolvedCDFScanFile { ... });
let removes = rm_dv.map(treemap_to_bools)
    .map(|sel| sel.into_iter().map(Not::not).collect())
    .map(|sel| ResolvedCDFScanFile { ... });
[adds, removes].into_iter().flatten()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... except I think I got the selection vector inversion bit wrong; it should only be inverted if it was a remove DV on an Add action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding treemap_to_bools: I noticed that it initializes the vec with all true be default, then sets elements to false. I observed that when I initialized the vec with all false and only selected elements to true, it's significantly faster. I found a post explaining why this may be. Zeroed out pages seem to work well with the kernel :)

Given that the selection vector is O(#rows) it may worth avoiding the overhead of iterating over it again with Not::not and take advantage of the zeroed out allocation.

I'm assuming here #rows could be very large.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across a weird case 🤔. The way you have it, we seem to have

CdfScanFile {
    add_dv: DvInfo,
    rm_dv: DvInfo
    ...
}

where the deletion vector in add_dv/rm_dv could be None.

The problem comes in when differentiating the following cases:

  1. [Remove(path=A, dv=None), Add(path=A, dv=X)]
  2. [Add(path=A, dv=X)].

This has got me thinking whether case 2 would ever happen. Suppose we want to handle it:

I think both of these would fall into this case:

(add_dv, None, CDFScanFileType::Add | CDFScanFileType::Cdc) => {
        (Some(add_dv.get_or_else(|| Default::default())), None)
}

So I'm instead electing to go with something like this:

CdfScanFile {
    deletion_vector: DvInfo,
    remove_dv: Option<DvInfo>
    ...
}

so that the cases become clear:

  1. remove_dv = Some(DvInfo( deletion_vector: None))
  2. remove_dv = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can totally imagine zero-initialized is more efficient for many reasons (tho correctness trumps performance, if there is friction between the two).

Your newer approach to handling add vs. rm DV makes a lot of sense, thanks!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ryan's proposed simplification makes sense. otherwise looks pretty good. just re-request when that's resolved.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, but looks great

Comment on lines 39 to 41
DvInfo {
deletion_vector: Some(deletion_vector),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy wrangling nit:

Suggested change
DvInfo {
deletion_vector: Some(deletion_vector),
}
let deletion_vector = Some(deletion_vector);
DvInfo { deletion_vector }

//
// The result of this commit is:
// - row 0 is restored
// - row 1 is unchanged
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// - row 1 is unchanged
// - row 1 is unchanged (previously deleted)

Comment on lines 139 to 141
let rm_scan_file = CdfScanFile {
scan_type: CdfScanFileType::Remove,
..scan_file.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This only gets used if we have Some rm_dv? Maybe rust is smart enough to delay creating it until we actually use it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the add/rm paths below are the ~same code duplicated. Maybe we could solve both issues in one swoop with a helper method?

let resolve = |dv, scan_file_fn| dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
    scan_file: scan_file_fn(),
    sselection_vector: (!sv.is_emtpy()).then_some(sv),
});
let removes = resolve(rm_dv, || CdfScanFile { ... });
let adds = resolve(add_dv, || scan_file);

Copy link
Collaborator Author

@OussamaSaoudi-db OussamaSaoudi-db Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm rust has a tricky type system. One of them is Fn() -> CdfScanFile while the other one is FnOnce. So I have to do a clone() in the ||scan_file.clone() closure to make it Fn. Also the clone may be optimized by the compiler, but I'm not sure.

I implemented something that I think lazily constructs the rm scan file and helps a bit with the duplication.

If you have tips on viewing optimizations for rust, please let me know! I'm familiar with GodBolt, but I feel like assembly is too low level for the optimizations to be clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you probably just needed to do move || scan_file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem is that move || scan_file is FnOnce, while the other one is Fn. move | | scan_file is FnOnce because we can only return the captured owned value once. Meanwhile || CdfScanFile { ... } clones its reference to scan_file so it can be called repeatedly. Thus it's Fn

We could make the closure take scan_file_fn: &dyn Fn() -> CdfScanFile, but then we'd have to use | | scan_file.clone() for the add case.

    let resolve = |dv: Option<RoaringTreemap>, scan_file_fn: &dyn Fn() -> CdfScanFile| {
        dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
            scan_file: scan_file_fn(),
            selection_vector: (!sv.is_empty()).then_some(sv),
        })
    };
    let removes = resolve(rm_dv, &|| CdfScanFile {
        scan_type: CdfScanFileType::Remove,
        ..scan_file.clone()
    });
    let adds = resolve(add_dv, &|| scan_file.clone());

I tried scan_file_fn: &dyn FnOnce() -> CdfScanFile but the compiler isn't happy because you need to consume ownership of the function to call a FnOnce

    let resolve = |dv: Option<RoaringTreemap>, scan_file_fn: &dyn FnOnce() -> CdfScanFile| {
        dv.map(treemap_to_bools).map(|sv| ResolvedCdfScanFile {
            scan_file: scan_file_fn(),
            selection_vector: (!sv.is_empty()).then_some(sv),
        })
    };

    let removes = resolve(rm_dv, &|| CdfScanFile {
        scan_type: CdfScanFileType::Remove,
        ..scan_file.clone()
    });
    let adds = resolve(add_dv, &|| scan_file);
    Ok([removes, adds].into_iter().flatten())

Moreover generics don't work with closures, so we'd need a proper function fn resolve<T>(dv: ..., scan_file_fn: T),

Perhaps there's a way to specify generic trait impls so that we monomorphize into two resolves: resolve::<impl FnOnce>, and resolve::<impl Fn> but by this point I felt like this added more lines and complexity than it was worth.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 🥳

/// read the deletion vector (if present), and each is converted into a [`ResolvedCdfScanFile`].
/// No changes are made to the `scan_type`.
#[allow(unused)]
fn resolve_scan_file_dv(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice that this actually ended up pretty short :)

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments but let's iterate on this!

@@ -402,6 +414,26 @@ mod tests {
assert_eq!(bools, expected);
}

#[test]
fn test_sv_to_bools() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how expensive is this to run? if cheap, remove #[ignore] from the one above? if expensive should we ignore this one too? or actually if this is relatively expensive but we decide to keep it, maybe we should integrate it with the test above? (just check both apis in the same test?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that it ran just fine on my machine, whereas the other one took a minute.

Giving an algorithmic complexity is hard to say tbh. I go into it in this comment, but the TLDR is that the zero-initialized large sv vector is fast to allocate. Just get a bunch of pages from the OS. The other one is slow because rust has to actually read the pages and set it to true.

I can disable it for consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh right i recall. how about we leave it enabled then and just leave that comment here in the code with it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is pretty staggering lol. 0.85s vs 40s on my machine!!!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's not keep the slow one. kinda nice we got some coverage through the other path now tho :)

/// types of `CdfScanFile`s:
/// 1. The first case is a [`CdfScanFile`] paired with a remove deletion vector. The `scan_type`
/// must be [`CdfScanFileType::Add`]. In this case, both the add and remove deletion vectors are
/// read if they exist. Then, we find the set of rows in the that have been added, and rows that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? "we find the set of rows in the that have been added"

// vector pairs, we generate selection treemaps.
//
// We use a motivating example to explain the deletion vector resolution. We read
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default.
// `rm_dv` and `add_dv`, and they are initialized to the empty map by default

Comment on lines +118 to +119
let adds = &rm_dv - &add_dv;
let removes = add_dv - rm_dv;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, do we know the time complexity of these operations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let n be size of LHS map, m be the size of RHS map. This is O(m * log(n) + n). This is a set difference, so it iterates over everything in the RHS (O(m)), and for each such element, removes it from the LHS O(log(n)). The adds case has to do a clone of the LHS, so we also add an O(n).

source

Comment on lines +132 to +136
let treemap_to_bools = if scan_file.remove_dv.is_some() {
selection_treemap_to_bools
} else {
deletion_treemap_to_bools
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we normalize above (i.e. everything to SV) and avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. Treemaps are sets, so constructing the inverse would require that we create a new "full" treemap containing every index (ie the number of rows in the data file). Then we take the full treemap and subtract out the selection treemap.

I think in most cases the set of deleted rows is much smaller than the set of all rows. So constructing such a treemap would be a lot more expensive.

let adds = add_dv
.map(treemap_to_bools)
.map(|sv| resolve(scan_file, sv));
Ok([removes, adds].into_iter().flatten())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if the iterator is the right approach? we only ever return one or two items, correct? does the (future) callsite necessitate the iterator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we only return one or two items.

The callsite doesn't necessitate it, but it works nicely with flattening:

            .map(move |scan_file| {
                resolve_scan_file_dv(dv_engine_ref.as_ref(), &table_root, scan_file?)
            }) // Iterator-Result-Iterator

If you know of a better way tho, lmk 👀

@zachschuermann
Copy link
Collaborator

(and i probably need to take a closer look at tests)

@OussamaSaoudi-db OussamaSaoudi-db merged commit 5816620 into delta-io:main Dec 10, 2024
20 checks passed
@zachschuermann zachschuermann removed the breaking-change Change that will require a version bump label Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants