`Statistics::is_exact` semantics #5613

crepererum · 2023-03-15T14:45:31Z

Describe the bug
It is unclear what Statistics::is_exact = false means. The docs are here:

https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37

These state for this case:

may contain an inexact estimate and may not be the actual value

What does "inexact" mean? Some potential definitions (we only consider Some(...) fields here!):

underestimate: There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source.
overestimate: All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do).
both: The statistics are only a rough guide.

I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Expected behavior
Clarify behavior.

Additional context
Cross-ref #997.

The text was updated successfully, but these errors were encountered:

alamb · 2023-03-15T15:56:08Z

both: The statistics are only a rough guide.

I think this is the best that we can get, for the reason you cite.

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Specifically, I think "inexact" means "best effort" but can not be relied on

cc @Dandandan @isidentical @metesynnada

crepererum · 2023-03-15T16:11:30Z

@alamb I think "overestimate" would be possible to achieve in many cases and quite helpful.

alamb · 2023-03-15T18:50:36Z

🤔 I see -- I can imagine the ranges being an over estimate (being at least as large as the actual range).

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

crepererum · 2023-03-16T09:11:50Z

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

I guess if the ranges (min/max) are overestimated / too wide, then the number of rows is likely an overestimate as well (upper bound).

Thinking about that more since this is getting really confusing with min/max/row_count/n_bytes because "overestimate" for "min" is the lower bound while of "max" it's the upper bound. So #997 already suggest to rework this attribute to be field-specific. I would propose to extend the interface even further:

struct Boundary<T: PartialOrd> {
    pub val: T,
    pub is_lower_bound: bool,
    pub is_upper_bound: bool,
}

impl<T: PartialOrd> Boundary<T> {
    pub fn is_exact(&self) -> bool {
        self.is_lower_bound && self.is_upper_bound
    }
}

pub struct Statistics {
    /// The number of table rows
    pub num_rows: Option<Boundary<usize>>,
    /// total bytes of the table rows
    pub total_byte_size: Option<Boundary<usize>>,
    /// Statistics on a column level
    pub column_statistics: Option<Vec<ColumnStatistics>>,
}

pub struct ColumnStatistics {
    /// Number of null values on column
    pub null_count: Option<Boundary<usize>>,
    /// Maximum value of column
    pub max_value: Option<Boundary<ScalarValue>>,
    /// Minimum value of column
    pub min_value: Option<Boundary<ScalarValue>>,
    /// Number of distinct values
    pub distinct_count: Option<Boundary<usize>>,
}

impl ColumnStatistics {
    pub fn min_max_exact(&self) -> bool {
        self.min_value.map(|b| b.is_exact()).unwrap_or_default()
        && self.max_value.map(|b| b.is_exact()).unwrap_or_default()
    }

    /// Does the range described by min-max contain ALL values?
    ///
    /// Note that the range might be too large. Some filters may not 
    /// have be considered when this range was determined.
    pub fn min_max_countains_all(&self) -> bool {
        self.min_value.map(|b| b.is_lower_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_upper_bound).unwrap_or_default()
    }

    /// Does the range described by min-max contain actual data?
    ///
    /// Note that there might be values outside of this range, esp. when the
    /// statistics were constructed using sampling.
    pub fn min_max_guaranteed_to_contain_value(&self) -> bool {
        self.min_value.map(|b| b.is_upper_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_lower_bound).unwrap_or_default()
    }
}

Note that the exact interface and names are TBD, but it's a rough idea. Also there might be similar interfaces in the pruning predicates and analysis passes, so maybe the Boundary struct can be reused.

alamb · 2023-03-17T13:27:59Z

so maybe the Boundary struct can be reused.

I agree this seems a better fit than trying to extend the Statistics interface. It turns out there are three different ways to do boundary analysis that I know of -- see #5535 perhaps one of them is sufficient for your usecase (I am not 100% clear what that is) 🤔

crepererum added the bug Something isn't working label Mar 15, 2023

crepererum mentioned this issue Mar 15, 2023

ParquetExec::statistics::is_exact likely wrong/misunderstood #5614

Open

alamb mentioned this issue Nov 15, 2023

Epic: Statistics improvements #8227

Open

19 tasks

edmondop mentioned this issue Jan 21, 2025

EPIC: Statistics improvements edmondop/arrow-datafusion#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Statistics::is_exact` semantics #5613

`Statistics::is_exact` semantics #5613

crepererum commented Mar 15, 2023

alamb commented Mar 15, 2023 •

edited

Loading

crepererum commented Mar 15, 2023

alamb commented Mar 15, 2023

crepererum commented Mar 16, 2023

alamb commented Mar 17, 2023

Statistics::is_exact semantics #5613

Statistics::is_exact semantics #5613

Comments

crepererum commented Mar 15, 2023

alamb commented Mar 15, 2023 • edited Loading

crepererum commented Mar 15, 2023

alamb commented Mar 15, 2023

crepererum commented Mar 16, 2023

alamb commented Mar 17, 2023

`Statistics::is_exact` semantics #5613

`Statistics::is_exact` semantics #5613

alamb commented Mar 15, 2023 •

edited

Loading