-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve deletion vectors to find inserted and removed rows for CDF #568
Changes from 30 commits
bd9585d
e2c2d90
3db127f
ec6e629
74eed10
9fe16db
8635e38
a17f4b0
5fab00c
21553a4
08867f8
1228107
97a5790
4969e44
dcb17fa
ebaf225
bd49142
6287e6e
7fe4f4d
d4f95d5
8a8f6bf
eeaabb0
fa13ade
4dabdaf
7fcb531
577b424
a970df1
5b70aab
3e0ead6
0b207c4
68a4fb1
3259769
2e9c29e
5767bf1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,17 +3,16 @@ | |
use std::collections::HashMap; | ||
use std::sync::LazyLock; | ||
|
||
use crate::actions::deletion_vector::deletion_treemap_to_bools; | ||
use crate::utils::require; | ||
use crate::{ | ||
actions::{ | ||
deletion_vector::{treemap_to_bools, DeletionVectorDescriptor}, | ||
visitors::visit_deletion_vector_at, | ||
}, | ||
actions::{deletion_vector::DeletionVectorDescriptor, visitors::visit_deletion_vector_at}, | ||
engine_data::{GetData, RowVisitor, TypedGetData as _}, | ||
schema::{ColumnName, ColumnNamesAndTypes, DataType, SchemaRef}, | ||
table_features::ColumnMappingMode, | ||
DeltaResult, Engine, EngineData, Error, | ||
}; | ||
use roaring::RoaringTreemap; | ||
use serde::{Deserialize, Serialize}; | ||
use tracing::warn; | ||
|
||
|
@@ -30,11 +29,18 @@ pub struct GlobalScanState { | |
} | ||
|
||
/// this struct can be used by an engine to materialize a selection vector | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
#[derive(Default, Debug, Clone, PartialEq, Eq)] | ||
pub struct DvInfo { | ||
pub(crate) deletion_vector: Option<DeletionVectorDescriptor>, | ||
} | ||
|
||
impl From<DeletionVectorDescriptor> for DvInfo { | ||
fn from(deletion_vector: DeletionVectorDescriptor) -> Self { | ||
let deletion_vector = Some(deletion_vector); | ||
DvInfo { deletion_vector } | ||
} | ||
} | ||
|
||
/// Give engines an easy way to consume stats | ||
#[derive(Debug, Clone, PartialEq, Eq, Deserialize)] | ||
#[serde(rename_all = "camelCase")] | ||
|
@@ -53,20 +59,27 @@ impl DvInfo { | |
self.deletion_vector.is_some() | ||
} | ||
|
||
pub fn get_selection_vector( | ||
pub(crate) fn get_treemap( | ||
&self, | ||
engine: &dyn Engine, | ||
table_root: &url::Url, | ||
) -> DeltaResult<Option<Vec<bool>>> { | ||
let dv_treemap = self | ||
.deletion_vector | ||
) -> DeltaResult<Option<RoaringTreemap>> { | ||
self.deletion_vector | ||
.as_ref() | ||
.map(|dv_descriptor| { | ||
let fs_client = engine.get_file_system_client(); | ||
dv_descriptor.read(fs_client, table_root) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just double checking: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe it does. It just handles reading the on-disk file and turning it into a roaring treemap. Something at a higher level needs to slice it up if needed. |
||
}) | ||
.transpose()?; | ||
Ok(dv_treemap.map(treemap_to_bools)) | ||
.transpose() | ||
} | ||
|
||
pub fn get_selection_vector( | ||
&self, | ||
engine: &dyn Engine, | ||
table_root: &url::Url, | ||
) -> DeltaResult<Option<Vec<bool>>> { | ||
let dv_treemap = self.get_treemap(engine, table_root)?; | ||
Ok(dv_treemap.map(deletion_treemap_to_bools)) | ||
} | ||
|
||
/// Returns a vector of row indexes that should be *removed* from the result set | ||
|
@@ -112,11 +125,11 @@ pub type ScanCallback<T> = fn( | |
/// ## Example | ||
/// ```ignore | ||
/// let mut context = [my context]; | ||
/// for res in scan_data { // scan data from scan.get_scan_data() | ||
/// for res in scan_data { // scan data from scan.scan_data() | ||
/// let (data, vector) = res?; | ||
/// context = delta_kernel::scan::state::visit_scan_files( | ||
/// data.as_ref(), | ||
/// vector, | ||
/// selection_vector, | ||
/// context, | ||
/// my_callback, | ||
/// )?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive is this to run? if cheap, remove
#[ignore]
from the one above? if expensive should we ignore this one too? or actually if this is relatively expensive but we decide to keep it, maybe we should integrate it with the test above? (just check both apis in the same test?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that it ran just fine on my machine, whereas the other one took a minute.
Giving an algorithmic complexity is hard to say tbh. I go into it in this comment, but the TLDR is that the zero-initialized large sv vector is fast to allocate. Just get a bunch of pages from the OS. The other one is slow because rust has to actually read the pages and set it to true.
I can disable it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh right i recall. how about we leave it enabled then and just leave that comment here in the code with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is pretty staggering lol. 0.85s vs 40s on my machine!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let's not keep the slow one. kinda nice we got some coverage through the other path now tho :)