-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics::is_exact
semantics
#5613
Comments
I think this is the best that we can get, for the reason you cite.
Specifically, I think "inexact" means "best effort" but can not be relied on |
@alamb I think "overestimate" would be possible to achieve in many cases and quite helpful. |
🤔 I see -- I can imagine the ranges being an over estimate (being at least as large as the actual range). I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ? |
I guess if the ranges (min/max) are overestimated / too wide, then the number of rows is likely an overestimate as well (upper bound). Thinking about that more since this is getting really confusing with min/max/row_count/n_bytes because "overestimate" for "min" is the lower bound while of "max" it's the upper bound. So #997 already suggest to rework this attribute to be field-specific. I would propose to extend the interface even further: struct Boundary<T: PartialOrd> {
pub val: T,
pub is_lower_bound: bool,
pub is_upper_bound: bool,
}
impl<T: PartialOrd> Boundary<T> {
pub fn is_exact(&self) -> bool {
self.is_lower_bound && self.is_upper_bound
}
}
pub struct Statistics {
/// The number of table rows
pub num_rows: Option<Boundary<usize>>,
/// total bytes of the table rows
pub total_byte_size: Option<Boundary<usize>>,
/// Statistics on a column level
pub column_statistics: Option<Vec<ColumnStatistics>>,
}
pub struct ColumnStatistics {
/// Number of null values on column
pub null_count: Option<Boundary<usize>>,
/// Maximum value of column
pub max_value: Option<Boundary<ScalarValue>>,
/// Minimum value of column
pub min_value: Option<Boundary<ScalarValue>>,
/// Number of distinct values
pub distinct_count: Option<Boundary<usize>>,
}
impl ColumnStatistics {
pub fn min_max_exact(&self) -> bool {
self.min_value.map(|b| b.is_exact()).unwrap_or_default()
&& self.max_value.map(|b| b.is_exact()).unwrap_or_default()
}
/// Does the range described by min-max contain ALL values?
///
/// Note that the range might be too large. Some filters may not
/// have be considered when this range was determined.
pub fn min_max_countains_all(&self) -> bool {
self.min_value.map(|b| b.is_lower_bound).unwrap_or_default()
&& self.max_value.map(|b| b.is_upper_bound).unwrap_or_default()
}
/// Does the range described by min-max contain actual data?
///
/// Note that there might be values outside of this range, esp. when the
/// statistics were constructed using sampling.
pub fn min_max_guaranteed_to_contain_value(&self) -> bool {
self.min_value.map(|b| b.is_upper_bound).unwrap_or_default()
&& self.max_value.map(|b| b.is_lower_bound).unwrap_or_default()
}
} Note that the exact interface and names are TBD, but it's a rough idea. Also there might be similar interfaces in the pruning predicates and analysis passes, so maybe the |
I agree this seems a better fit than trying to extend the |
Describe the bug
It is unclear what
Statistics::is_exact
=false
means. The docs are here:https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37
These state for this case:
What does "inexact" mean? Some potential definitions (we only consider
Some(...)
fields here!):I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).
Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.
Expected behavior
Clarify behavior.
Additional context
Cross-ref #997.
The text was updated successfully, but these errors were encountered: