Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never fallback to cartesian product for join estimation when we know the min/max values for columns #3813

Open
isidentical opened this issue Oct 12, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@isidentical
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
distinct_count is usually expensive to compute, so some platforms which save parquet files abstain from injecting it at the metadata section. We should be able to estimate the join cardinality without it before falling back to cartesian product.

Describe the solution you'd like
Since we already require min/max values to be present, we should be able to just do min(num_left_rows - num_nulls or 0, scalar_range(left_stats.min, left_stats.max)) to determine an alternative distinct count.

Describe alternatives you've considered
None.

Additional context
Original discussion was here #3787 (comment)

@Dandandan
Copy link
Contributor

Dandandan commented Oct 12, 2022

There is also this presentation about optimizing the order of joins without statistics available (which also seems to do fine for DuckDB). We could also see if we can reuse some of these ideas:

https://www.youtube.com/watch?v=aNRoR0Z3SzU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants