feat: fix OuterReferenceColumns not being rewritten correctly (take 2) #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🗣 Description
Attempt number 2 at fixing the bug, original PR was #38. Attached is the context from that PR.
The missing piece was that I needed to add a pass to collect any subquery table scans from expressions, since they were being collecting during the original logic and I broke it when I only considered the table scans from the logical plan nodes.
Original Description
An OuterReferenceColumn (aka correlated subquery) is a column in a subquery that references a column from the parent query.
Consider the following query:
In this query,
e.department
in the subquery refers to the department from the outer query'semployees
table. For each employee in the outer query, the subquery calculates the average salary for their department.This is a correlated subquery because the inner query references the outer table e. The subquery runs once for each row in the outer query (logically, in practice this almost never happens since it would be so inefficient).
When federating a LogicalPlan to a remote database, we rewrite all table references from referencing their original name that they are registered in DataFusion as, into their remote name as they are registered in their remote database.
This PR fixes a bug in how that rewrite happened. Previously we would DFS search the LogicalPlan to find all of the TableScans, which allows us to build up a map of rewrites from the DataFusion table name to the remote table name. This usually works well, because a higher-level plan or expression can't reference a table that hasn't been scanned - except in the case of correlated subqueries. It is possible that we can come across an expression that has a correlated subquery which references a table that we haven't come across the TableScan for yet. When this happened, we were incorrectly skipping the rewrite - leading to an incorrect final query.
The solution is to first do a pass to find all of the TableScans to build up the map, and then do the rewrites once we have collected all of them.
🔨 Related Issues
🤔 Concerns