Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: JOIN should require ON condition #1552

Closed
wants to merge 9 commits into from

Conversation

demetribu
Copy link
Contributor

Closes: #1550

@demetribu demetribu changed the title on condition requirement for join Fix: JOIN should require ON condition Nov 25, 2024
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! cc @alamb

@mvzink
Copy link
Contributor

mvzink commented Nov 27, 2024

MySQL actually does allow joins with no ON/USING, so this could break some (admittedly weird) workflows.

@demetribu
Copy link
Contributor Author

demetribu commented Nov 27, 2024

What I found

In MySQL, JOIN, CROSS JOIN, and INNER JOIN are syntactic equivalents (they can replace each other). In standard SQL, they are not equivalent. INNER JOIN is used with an ON clause, CROSS JOIN is used otherwise.
https://dev.mysql.com/doc/refman/8.4/en/join.html

In Snowflake If you use INNER JOIN without the ON clause (or if you use comma without a WHERE clause), the result is the same as using CROSS JOIN
https://docs.snowflake.com/en/sql-reference/constructs/join

In Oracle If two tables in a join query have no join condition, then Oracle Database returns their Cartesian product.
https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Joins.html#GUID-568EC26F-199A-4339-BFD9-C4A0B9588937

cc @findepi

@findepi
Copy link
Member

findepi commented Nov 27, 2024

so maybe we should fix this in DF only?

@demetribu
Copy link
Contributor Author

as I know, in df, we don’t parse SQL directly; we only work with statements.

@Dandandan
Copy link
Contributor

It could be implemented for the dialects that (don't) support the joins without ON (i.e. PostgreSQL / ansi, ...)?

@Dandandan
Copy link
Contributor

So we could:

@iffyio
Copy link
Contributor

iffyio commented Nov 28, 2024

Ah yeah using a dialect method for this makes sense in that case

@demetribu demetribu requested a review from iffyio November 29, 2024 14:53
Comment on lines +691 to +699
/// Verifies whether the provided `JoinOperator` is supported by this SQL dialect.
/// Returns `true` if the `JoinOperator` is supported, otherwise `false`.
fn verify_join_operator(&self, _join_operator: &JoinOperator) -> bool {
true
}

/// Verifies if the given `JoinOperator`'s constraint is valid for this SQL dialect.
/// Returns `true` if the join constraint is valid, otherwise `false`.
fn verify_join_constraint(&self, join_operator: &JoinOperator) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so since the behavior seems to vary a bit across dialects, I'm thinking it could make sense after all if we let the parse continue to be permissive in syntax and downstream crates can perform the additional checks in the cases where specific combinations need to be enforced?

Copy link
Contributor Author

@demetribu demetribu Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your idea, and it makes sense. But the "issue" is based on the idea that the parser should restrict cases like “INNER JOIN” without an ON condition.

For me, as a developer, it would feel strange if I pick a parser, for example, the PostgreSQL dialect, and it allows “ANTI JOIN” without any error. This would mean I have to check all the statements again to find mistakes. It seems like a trade-off, and we need to decide which approach works best.

Or are you suggesting moving this specific implementation directly to GenericDialect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is indeed a tradeoff, the expectation currently is that downstream crates further validate the output of the parser against any dialect specific requirements/invariants, see note here for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb, could you determine whether the issue outlined in #13486 is valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion:

  1. if there are any dialects that support <left> JOIN <right> without an ON clause then so should sqlparser
  2. If there are no dialects that support such syntax, then erroring is a good idea

I have not done the research to know if there are any dialects that support such syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, DataFusion processes SELECT ... FROM l JOIN r as a CROSS JOIN for all dialects, including DuckDB, where an INNER JOIN explicitly requires an ON clause.

@Dandandan proposed addressing this issue in sqlparser-rs, which seems reasonable to me. However, after discussing it with @iffyio, I see

indeed a tradeoff, the expectation currently is that downstream crates further validate the output of the parser

My questions are:

  1. Should this behavior be considered a bug?
  2. If yes, at which level should it be addressed: datafusion or sqlparser-rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should this behavior be considered a bug?

I don't have a strong opinion -- is it causing anyone problems? It seems like the ramification of allowing JOIN... as CROSS JOIN 's largest implication is that now DataFusion has some new dialect.

If it isn't causing problems, my suggestion is do nothing until someone has a concrete usecase

@alamb
Copy link
Contributor

alamb commented Dec 4, 2024

Converting to draft to show we aren't necessairly planning to merge this

@demetribu demetribu closed this Dec 4, 2024
@demetribu demetribu deleted the fix-join-on-required branch December 4, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JOIN should require ON condition
6 participants