Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate table constraints through physical plans to optimize sort operations #14111

Merged

Conversation

gokselk
Copy link
Contributor

@gokselk gokselk commented Jan 13, 2025

Which issue does this PR close?

Closes #14110.

Rationale for this change

This PR extends the physical planner to propagate table constraints (PRIMARY KEY and UNIQUE) through the query plan. This allows us to optimize sort operations by recognizing when ordering requirements are already satisfied by existing constraints.

What changes are included in this PR?

  • Added Constraints propagation through physical plans including:
    • File scan executors (CSV, Parquet, Arrow, Avro, JSON)
    • Memory table executor
    • Aggregate executor
  • Added constraint projection logic
  • Updated EquivalenceProperties to consider constraints when evaluating sort requirements
  • Added tests for constraint propagation and sort optimization
  • Updated protobuf definitions to include constraints in physical plan serialization

Are these changes tested?

Yes, the changes include:

  • Tests for constraint projection and validation
  • Tests for sort optimization with constraints
  • sqllogictests verifying correct plan optimization

Are there any user-facing changes?

The changes are mostly internal optimizations, but users will see:

  • Improved query plans that eliminate redundant sorts
  • Updated EXPLAIN output that shows constraints in physical plans
  • More efficient execution of queries with ORDER BY on primary key columns

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jan 13, 2025
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Jan 13, 2025
@gokselk
Copy link
Contributor Author

gokselk commented Jan 13, 2025

cc: @berkaysynnada @ozankabak

@ozankabak
Copy link
Contributor

I left my reviews here: synnada-ai#53 (review)

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close to the finish line. Let's iterate over my comments

datafusion/physical-expr/src/equivalence/properties.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/equivalence/properties.rs Outdated Show resolved Hide resolved
match constraint {
Constraint::PrimaryKey(indices) => {
let new_indices =
update_elements_with_matching_indices(indices, proj_indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refactor the update_elements_with_matching_indices function to take two impl Iterator's (you probably need to replace the looping order to do that), this function can also accept proj_indices as an impl Iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_elements_with_matching_indices function uses .position() on proj_indices, which makes it necessary to clone it if we take it as an impl Iterator. I think this defeats the whole purpose of this refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind was to swap the loop order (iterate on proj_indices on the outer loop). That may enable us to use an impl Iterator for proj_indices. We probably will need to keep the type of entries as a slice because it does not have an ordering (though we can enforce that in a future PR). Had entries was ordered, I think we could have also taken it in as an impl Iterator -- but let's leave the latter for a future PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked into this with @berkaysynnada and it seems to have some intricacies. Let's leave this to another PR

.map(|col| col.index())
.collect::<Vec<_>>();
debug_assert_eq!(mapping.map.len(), indices.len());
self.constraints.project(&indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will not need to collect and materialize indices if you refactor project to accept an iterator. See my comment in functional_dependencies.rs.

datafusion/physical-plan/src/memory.rs Outdated Show resolved Hide resolved
@gokselk gokselk force-pushed the feature/physical-planner-functional-dependence branch from a6c83a7 to 726737b Compare January 15, 2025 14:18
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this carefully (twice) and it LGTM

@berkaysynnada
Copy link
Contributor

I'll merge this PR once the main branch is all green

@berkaysynnada berkaysynnada merged commit 3cd31af into apache:main Jan 16, 2025
25 checks passed
@berkaysynnada
Copy link
Contributor

Great efforts! Thank you @gokselk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leverage common prefix ordering constraints in physical planner
3 participants