Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formatting the AST while preserving the source location information from the original query #1634

Open
lovasoa opened this issue Jan 2, 2025 · 8 comments

Comments

@lovasoa
Copy link
Contributor

lovasoa commented Jan 2, 2025

In the current sqlparser, parsing

select
  a,
  b
from
  t

and then fromatting it back to a string results in

select a,b from t

Since sqlparser v0.53, we have the source location information inside the AST, which theoretically makes it possible to reconstruct the original query formatting (or at least getting very close to it).

I know this is a huge undertaking, but it would be really useful. For instance, in sqlpage, I would love being able to respect the initial user formatting of queries, in order for database errors that contain source position information to be meaningful and not confusing.

I'm opening this issue to start discussing this with the community (and especially with you @iffyio and @alamb 😉 ).

Am I the only one interested in this ? What would be the best approach to implementing this ?

lovasoa added a commit to lovasoa/sqlparser-rs that referenced this issue Jan 2, 2025
…implementation)

this implements (a tiny portion of) apache#1634

pros: really useful when passing formatted queries to a real database, in order for database error message locations to match the original user's source locations

cons: if we want to do it well, we need to track source locations better, and this adds a complexity to the Display imlementations
@freshtonic
Copy link

@lovasoa 👋 I'm interested in this

@alamb
Copy link
Contributor

alamb commented Jan 4, 2025

I don't understand the usecase for preserving the original formatting after Display the AST.

Specifically, to preserve the original SQL for error messages you could also keep the actual original SQL string somewhere and on error construct the error message using the orignal SQL (and the locations in that text that are now in Spans)

I think you could take that approach today without any additional work.

As you hint in this description, trying to preserve the original formatting would likely be a very large undertaking (and thus I suspect practically infeasible given the amount of engineering we have available)

@lovasoa
Copy link
Contributor Author

lovasoa commented Jan 4, 2025

Specifically, to preserve the original SQL for error messages you could also keep the actual original SQL string somewhere and on error construct the error message using the orignal SQL (and the locations in that text that are now in Spans)

I think there is a misunderstanding. My use case is: the user enters a query. I need to parse it, potentially edit the ast, and then stringify it back before feeding it to the database. When the database returns an error, the error contains position information. This position information references the stringified AST we fed to the database, not the original string entered by the user. Currently, there is no way to map that position in the output of Display back to a position in the original string.

As you hint in this description, trying to preserve the original formatting would likely be a very large undertaking (and thus I suspect practically infeasible given the amount of engineering we have available)

That's true. If I'm the only one with that need, then it's probably not worth it. But maybe I'm not ? And the work does not have to be exhaustive to be useful. I've already done it just for respecting newlines and indentation in UPDATE statements (because the syntax is simpler than SELECT) in #1636. I think improving upon it and doing something similar at least for newlines around keywords and identifier lists in SELECT .. FROM ... WHERE would already be quite useful.

Regarding resources: @freshtonic , if this is useful to you, do you think this is something cipherstash could dedicate a few engineering hours to ?

@alamb
Copy link
Contributor

alamb commented Jan 4, 2025

I think there is a misunderstanding. My use case is: the user enters a query. I need to parse it, potentially edit the ast, and then stringify it back before feeding it to the database. When the database returns an error, the error contains position information. This position information references the stringified AST we fed to the database, not the original string entered by the user. Currently, there is no way to map that position in the output of Display back to a position in the original string

I see the challenge -- you have to potentially edit the AST

I wonder if you could do something (crazy) like

  1. re-parse the stringified AST to get span information of the Display format
  2. Walk the original AST and the re-parsed AST using the spans to create a mapping back to the original query

@lovasoa
Copy link
Contributor Author

lovasoa commented Jan 4, 2025

That could work, but it's not necessarily a lot less work than a Display implementation that respects the Span information, and the result would be wasteful and less generally useful to other users of sqlparser...

@graup
Copy link
Contributor

graup commented Jan 10, 2025

I'm playing around with a different approach for editing parts of the AST that works more like tools like eslint work. The idea is that given accurate source spans, why don't we just edit the original query directly without round-tripping the whole AST?

The steps would be: 1. parse query into AST. 2. Visit the AST, looking for nodes we want to replace. 3. Construct and render a new AST node for local replacement. 4. Use the old node's span() to replace the new text right in the source.

This way we preserve all the other formatting outside of the node we're replacing, without any work on the AST displaying logic.

Of course, this needs accurate and complete spans which we don't have yet, but sounds feasible to me.

@lovasoa
Copy link
Contributor Author

lovasoa commented Jan 10, 2025

That would be great ! But with the current (incorrect) span information, we would end up with syntactically invalid sql.

@alamb
Copy link
Contributor

alamb commented Jan 11, 2025

That would be great ! But with the current (incorrect) span information, we would end up with syntactically invalid sql.

This seems like it is a reason to focus on improving the span accuracy (which is valuable for other reasons too) first before trying to also carry along the original tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants