-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set projection before configuring the source #14685
Changes from 4 commits
42d4403
a8709ff
582aec2
1b8a39c
54bdbb1
89ed225
9dd4e38
09a2f6e
e87ffb9
616486d
1ac9bbf
a74637c
dbd4f67
7255a16
9ca4e9d
e9ef934
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -345,6 +345,32 @@ impl FileScanConfig { | |
/// Set the projection of the files | ||
pub fn with_projection(mut self, projection: Option<Vec<usize>>) -> Self { | ||
self.projection = projection; | ||
self.with_updated_statistics() | ||
} | ||
|
||
// Update source statistics with the current projection data | ||
fn with_updated_statistics(mut self) -> Self { | ||
let max_projection_column = *self | ||
.projection | ||
.as_ref() | ||
.and_then(|proj| proj.iter().max()) | ||
.unwrap_or(&0); | ||
|
||
if max_projection_column | ||
>= self.file_schema.fields().len() + self.table_partition_cols.len() | ||
{ | ||
// we don't yet have enough information (file schema info or partition column info) to perform projection | ||
return self; | ||
} | ||
|
||
let ( | ||
_projected_schema, | ||
_constraints, | ||
projected_statistics, | ||
_projected_output_ordering, | ||
) = self.project(); | ||
|
||
self.source = self.source.with_statistics(projected_statistics); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't fully understand why the source would need projected statistics I am testing out if the issue is that the FileScanConfig is providing the wrong statistics (like maybe this line should be self.statistics rather than self.source.statistics There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a great idea! We can't use This made the PR a bit messier, and I had to comment several test lines - LMK if you prefer the old version |
||
self | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about let source = self.source.clone();
self.with_source(source) |
||
} | ||
|
||
|
@@ -383,7 +409,7 @@ impl FileScanConfig { | |
/// Set the partitioning columns of the files | ||
pub fn with_table_partition_cols(mut self, table_partition_cols: Vec<Field>) -> Self { | ||
self.table_partition_cols = table_partition_cols; | ||
self | ||
self.with_updated_statistics() | ||
} | ||
|
||
/// Set the output ordering of the files | ||
|
@@ -737,6 +763,13 @@ mod tests { | |
), | ||
]; | ||
// create a projected schema | ||
|
||
let statistics = Statistics { | ||
num_rows: Precision::Inexact(3), | ||
total_byte_size: Precision::Absent, | ||
column_statistics: Statistics::unknown_column(&file_batch.schema()), | ||
}; | ||
|
||
let conf = config_for_projection( | ||
file_batch.schema(), | ||
// keep all cols from file and 2 from partitioning | ||
|
@@ -747,10 +780,20 @@ mod tests { | |
file_batch.schema().fields().len(), | ||
file_batch.schema().fields().len() + 2, | ||
]), | ||
Statistics::new_unknown(&file_batch.schema()), | ||
statistics.clone(), | ||
to_partition_cols(partition_cols.clone()), | ||
); | ||
|
||
let source_statistics = conf.source.statistics().unwrap(); | ||
|
||
// statistics should be preserved and passed into the source | ||
assert_eq!(source_statistics.num_rows, Precision::Inexact(3)); | ||
|
||
// 3 original statistics + 2 partition statistics | ||
assert_eq!(source_statistics.column_statistics.len(), 5); | ||
|
||
let (proj_schema, ..) = conf.project(); | ||
|
||
// created a projector for that projected schema | ||
let mut proj = PartitionColumnProjector::new( | ||
proj_schema, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this case possible? it seems not obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this now. but answering to the question - this could happen if
projection
set buttable_partition_cols
isn't yet set (or vice versa).This whole logic should be much cleaner when we switch to the builder approach (I want to do a PR on top after this one is merged)