Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fix OuterReferenceColumns not being rewritten correctly (take 2) #39

Merged
merged 4 commits into from
Jan 13, 2025

Conversation

phillipleblanc
Copy link

🗣 Description

Attempt number 2 at fixing the bug, original PR was #38. Attached is the context from that PR.

The missing piece was that I needed to add a pass to collect any subquery table scans from expressions, since they were being collecting during the original logic and I broke it when I only considered the table scans from the logical plan nodes.

Original Description

An OuterReferenceColumn (aka correlated subquery) is a column in a subquery that references a column from the parent query.

Consider the following query:

SELECT employee_name,
       (SELECT AVG(salary)
        FROM salaries s2
        WHERE s2.department = e.department) as dept_avg_salary
FROM employees e;

In this query, e.department in the subquery refers to the department from the outer query's employees table. For each employee in the outer query, the subquery calculates the average salary for their department.
This is a correlated subquery because the inner query references the outer table e. The subquery runs once for each row in the outer query (logically, in practice this almost never happens since it would be so inefficient).

When federating a LogicalPlan to a remote database, we rewrite all table references from referencing their original name that they are registered in DataFusion as, into their remote name as they are registered in their remote database.

This PR fixes a bug in how that rewrite happened. Previously we would DFS search the LogicalPlan to find all of the TableScans, which allows us to build up a map of rewrites from the DataFusion table name to the remote table name. This usually works well, because a higher-level plan or expression can't reference a table that hasn't been scanned - except in the case of correlated subqueries. It is possible that we can come across an expression that has a correlated subquery which references a table that we haven't come across the TableScan for yet. When this happened, we were incorrectly skipping the rewrite - leading to an incorrect final query.

The solution is to first do a pass to find all of the TableScans to build up the map, and then do the rewrites once we have collected all of them.

🔨 Related Issues

🤔 Concerns

@phillipleblanc phillipleblanc added the bug Something isn't working label Jan 13, 2025
@phillipleblanc phillipleblanc requested a review from a team January 13, 2025 07:32
@phillipleblanc phillipleblanc self-assigned this Jan 13, 2025
@phillipleblanc phillipleblanc merged commit d869741 into spiceai-43 Jan 13, 2025
17 checks passed
@phillipleblanc phillipleblanc deleted the phillip/250113-fix-federation-rewrite-bug branch January 13, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants