Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hint for missing fields #14521

Merged
merged 10 commits into from
Feb 10, 2025
Merged

feat: add hint for missing fields #14521

merged 10 commits into from
Feb 10, 2025

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Feb 6, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Lordworms -- this looks really nice ❤️

I had one question, but otherwise this looks ready to me

cc @adriangb

statement ok
create table a(timestamp int, birthday int);

query error DataFusion error: Schema error: No field named timetamp\. Did you mean 'a\.timestamp'\?\.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -90,16 +90,16 @@ drop table case_insensitive_test
statement ok
CREATE TABLE test("Column1" string) AS VALUES ('content1');

statement error DataFusion error: Schema error: No field named column1. Valid fields are test\."Column1"\.
statement error DataFusion error: Schema error: No field named column1\. Valid fields are test\."Column1"\.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should't these errors also contain the nice new message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have to use full name to do match and the threshold is 0.5 now but in this case it is 'test.Column1' and column1 so the score is happen to be 0.5

@adriangb
Copy link
Contributor

adriangb commented Feb 6, 2025

Amazing work!

It seems like these will just be part of the existing error message? Wouldn't it make sense to integrate with the new APIs in #13664 while we're at it?

I'd also suggest a test showing that a threshold of 0.5 is reasonable. Maybe a table with multiple columns that are increasingly further apart by eyeball (timesamp, timeamp, timp, ts, tokens, amp, foo?) and see which ones get returned and which don't.

query error DataFusion error: Schema error: No field named timetamp\. Did you mean 'a\.timestamp'\?\.
select timetamp from a;

query error DataFusion error: Schema error: No field named dadsada\. Valid fields are a\.timestamp, a\.birthday\.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still old message 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we would only do 'Did you mean' with a match over 0.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so in this case there were no matches over 0.5 and thus all of the fields were shown

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

It seems like these will just be part of the existing error message? Wouldn't it make sense to integrate with the new APIs in #13664 while we're at it?

FYI @eliaperantoni for your comments / suggestions

@Lordworms
Copy link
Contributor Author

threshold of 0.5 is reasonable. Maybe a table with multiple columns that are increasingly further apart by eyeball (timesamp, timeamp, timp, ts, tokens, amp, foo?) and see which ones get returned and which don't.

Amazing work!

It seems like these will just be part of the existing error message? Wouldn't it make sense to integrate with the new APIs in #13664 while we're at it?

I can refactor that

I'd also suggest a test showing that a threshold of 0.5 is reasonable. Maybe a table with multiple columns that are increasingly further apart by eyeball (timesamp, timeamp, timp, ts, tokens, amp, foo?) and see which ones get returned and which don't.

@github-actions github-actions bot added the sql SQL Planner label Feb 7, 2025
Copy link
Contributor

@eliaperantoni eliaperantoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! I love that Diagnostic is gaining so much traction ❤️. I have one comment, but otherwise looks perfect to me.

datafusion/common/src/column.rs Outdated Show resolved Hide resolved
datafusion/common/src/error.rs Outdated Show resolved Hide resolved
Comment on lines -204 to +205
assert_eq!(
diag.notes[0].message,
"possible reference to 'first_name' in table 'a'"
);
assert_eq!(
diag.notes[1].message,
"possible reference to 'first_name' in table 'b'"
);
assert_eq!(diag.notes[0].message, "possible column a.first_name");
assert_eq!(diag.notes[1].message, "possible column b.first_name");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happened if a very similar column (e.g. first_neme) was in scope. I guess they would see "possible column a.first_neme" even though there wasn't really ambiguity there because the planner didn't match that column.

I don't think this is desirable. What do you think @comphead @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling ambiguity might be okay, the final choice is on the user but what is definitely needs to avoid ig leventshein gets false positives and get 0.5 match. in this case the closest column name will be proposed(which can be totally different from what user typed) but others also hidden. If that happen user will have to do more action like EXPLAIN to find the schema. Not sure how to test such scenario though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if you say non-exact matches should be shown too, I'm on board.

Comment on lines 304 to 320
.map_err(|e| match &e {
DataFusionError::SchemaError(
SchemaError::FieldNotFound {
field,
valid_fields,
},
_,
) => {
let mut diagnostic = Diagnostic::new_error(
format!("column '{}' not found", &field.name()),
field.spans().first(),
);
add_possible_columns_to_diag(&mut diagnostic, field, valid_fields);
e.with_diagnostic(diagnostic)
}
_ => e,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ❤️

@alamb
Copy link
Contributor

alamb commented Feb 9, 2025

I merged up from main to resolve a conflict with this branch

@alamb alamb merged commit 17396a5 into apache:main Feb 10, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

Thanks everyone! This is epic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add hint to errors for missing fields based on Levenshtein distance
5 participants