Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics::is_exact semantics #5613

Open
crepererum opened this issue Mar 15, 2023 · 5 comments
Open

Statistics::is_exact semantics #5613

crepererum opened this issue Mar 15, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@crepererum
Copy link
Contributor

Describe the bug
It is unclear what Statistics::is_exact = false means. The docs are here:

https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37

These state for this case:

may contain an inexact estimate and may not be the actual value

What does "inexact" mean? Some potential definitions (we only consider Some(...) fields here!):

  • underestimate: There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source.
  • overestimate: All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do).
  • both: The statistics are only a rough guide.

I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Expected behavior
Clarify behavior.

Additional context
Cross-ref #997.

@crepererum crepererum added the bug Something isn't working label Mar 15, 2023
@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

both: The statistics are only a rough guide.

I think this is the best that we can get, for the reason you cite.

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Specifically, I think "inexact" means "best effort" but can not be relied on

cc @Dandandan @isidentical @metesynnada

@crepererum
Copy link
Contributor Author

@alamb I think "overestimate" would be possible to achieve in many cases and quite helpful.

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

🤔 I see -- I can imagine the ranges being an over estimate (being at least as large as the actual range).

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

@crepererum
Copy link
Contributor Author

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

I guess if the ranges (min/max) are overestimated / too wide, then the number of rows is likely an overestimate as well (upper bound).

Thinking about that more since this is getting really confusing with min/max/row_count/n_bytes because "overestimate" for "min" is the lower bound while of "max" it's the upper bound. So #997 already suggest to rework this attribute to be field-specific. I would propose to extend the interface even further:

struct Boundary<T: PartialOrd> {
    pub val: T,
    pub is_lower_bound: bool,
    pub is_upper_bound: bool,
}

impl<T: PartialOrd> Boundary<T> {
    pub fn is_exact(&self) -> bool {
        self.is_lower_bound && self.is_upper_bound
    }
}

pub struct Statistics {
    /// The number of table rows
    pub num_rows: Option<Boundary<usize>>,
    /// total bytes of the table rows
    pub total_byte_size: Option<Boundary<usize>>,
    /// Statistics on a column level
    pub column_statistics: Option<Vec<ColumnStatistics>>,
}

pub struct ColumnStatistics {
    /// Number of null values on column
    pub null_count: Option<Boundary<usize>>,
    /// Maximum value of column
    pub max_value: Option<Boundary<ScalarValue>>,
    /// Minimum value of column
    pub min_value: Option<Boundary<ScalarValue>>,
    /// Number of distinct values
    pub distinct_count: Option<Boundary<usize>>,
}

impl ColumnStatistics {
    pub fn min_max_exact(&self) -> bool {
        self.min_value.map(|b| b.is_exact()).unwrap_or_default()
        && self.max_value.map(|b| b.is_exact()).unwrap_or_default()
    }

    /// Does the range described by min-max contain ALL values?
    ///
    /// Note that the range might be too large. Some filters may not 
    /// have be considered when this range was determined.
    pub fn min_max_countains_all(&self) -> bool {
        self.min_value.map(|b| b.is_lower_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_upper_bound).unwrap_or_default()
    }

    /// Does the range described by min-max contain actual data?
    ///
    /// Note that there might be values outside of this range, esp. when the
    /// statistics were constructed using sampling.
    pub fn min_max_guaranteed_to_contain_value(&self) -> bool {
        self.min_value.map(|b| b.is_upper_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_lower_bound).unwrap_or_default()
    }
}

Note that the exact interface and names are TBD, but it's a rough idea. Also there might be similar interfaces in the pruning predicates and analysis passes, so maybe the Boundary struct can be reused.

@alamb
Copy link
Contributor

alamb commented Mar 17, 2023

so maybe the Boundary struct can be reused.

I agree this seems a better fit than trying to extend the Statistics interface. It turns out there are three different ways to do boundary analysis that I know of -- see #5535 perhaps one of them is sufficient for your usecase (I am not 100% clear what that is) 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants