Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Literal parsing bug #24784

Open
mosabua opened this issue Jan 23, 2025 · 6 comments
Open

Literal parsing bug #24784

mosabua opened this issue Jan 23, 2025 · 6 comments
Assignees

Comments

@mosabua
Copy link
Member

mosabua commented Jan 23, 2025

When testing for #24774 I found what I think is a bug.

Normally a underscore at the comma fails as expected .. but maybe with a wrong error message.

select 123456_.789;
Query 20250123_164910_00014_hkn3p failed: line 1:8: identifiers must not start with a digit; surround the identifier with double quotes
select 123456_.789

However if the underscore is after the comma .. it uses it as a name for the column

select 123_456._789;
  _789
--------
 123456
(1 row)

Query 20250123_164146_00012_hkn3p, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.13 [0 rows, 0B] [0 rows/s, 0B/s]

Similar error with double underscore:

select 123__456.789;
Query 20250123_165250_00021_hkn3p failed: line 1:8: identifiers must not start with a digit; surround the identifier with double quotes
select 123__456.789

I believe these should all fail .. but with a better error message.

@brunokim
Copy link
Contributor

brunokim commented Jan 26, 2025

The 1st and 3rd error messages are thrown here in SqlParser.java in the rule DIGIT_IDENTIFIER, that exists solely to catch this error.

I believe there are two ways to catch this error and give it a meaningful message:

  1. create a similar INVALID_NUMBER rule that catches malformed numbers, and throw an error;
  2. be more lenient on the number grammar allowing any number of digits and underscores, and validate them properly in Java.

I'll go with option #1 because it leaves the grammar tidier, but I can do #2 if you prefer.

Either way, a query like SELECT 1_ now shows a DIGIT_IDENTIFIER error and would then show an INVALID_NUMBER error, because number has larger precedence over identifier in primaryExpression. Is that ok? A query like SELECT 42 AS 1_ will still fail with a DIGIT_IDENTIFIER error, since it's in a position expecting an identifier.

I still didn't investigate the cause for the 2nd bug that @mosabua found, but I believe it won't happen after we introduce the INVALID_NUMBER rule.

@brunokim
Copy link
Contributor

BTW, is it intentional that doubles cannot contain underscores?

DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;

@kasiafi
Copy link
Member

kasiafi commented Jan 27, 2025

select 123_456._789;
However if the underscore is after the comma .. it uses it as a name for the column

@mosabua this is actually correct:

  • 123_456. is a valid decimal
  • _789 is a valid identifier for a column alias.

It is confusing because these two are not separated by a space.

@brunokim the change you proposed doesn't affect this case.

@kasiafi
Copy link
Member

kasiafi commented Jan 27, 2025

BTW, is it intentional that doubles cannot contain underscores?

@brunokim initially Trino did not support underscores at all. At some point we added support for underscore for exact numeric types, but we never did it for double. According to the spec 5.3 <literal>, underscore should be supported.

@martint was there anything that stopped us from supporting underscore in doubles?

@mosabua
Copy link
Member Author

mosabua commented Jan 27, 2025

Could we change this to force the requirement for a space .. the way it works now is super counterintuitive .. or maybe even just not allow it and throw an error.

@kasiafi
Copy link
Member

kasiafi commented Jan 30, 2025

Could we change this to force the requirement for a space .. the way it works now is super counterintuitive .. or maybe even just not allow it and throw an error.

@mosabua I don't think we can enforce a whitespace in a particular context. We would end up multiplying lexer rules. I agree this is not intuitive but this is a result of having numbers ending in a '.' and identifiers starting with a '_'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants