Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to print better parse errors #3700

Merged
merged 18 commits into from
Oct 20, 2023
Merged

Conversation

Baltoli
Copy link
Contributor

@Baltoli Baltoli commented Oct 11, 2023

This PR fixes #3672 by providing a best-effort rendering of the partial state of the parser when a parse error happens. To do this, we grab the Earley state corresponding to the span [0, token_index_before_error] and inspect the partial parse trees available, as well as the production being attempted at this point.

Consider as a simple example the following standard K definition for arithmetic expressions:

module ARITHMETIC-SYNTAX
  imports UNSIGNED-INT-SYNTAX
  imports ID-SYNTAX

  syntax Exp ::= Int
               | "-" Exp      [group(neg), strict]
               | Exp "+" Exp  [left, group(add), strict]
               | Exp "-" Exp  [left, group(sub), strict]
               | Exp "*" Exp  [left, group(mul), strict]
               | Exp "/" Exp  [left, group(div), strict]
               | "(" Exp ")"  [bracket]

  syntax priorities neg > mul div > add sub
endmodule

module ARITHMETIC
  imports ARITHMETIC-SYNTAX
endmodule

On trying to kast an invalid term (for example, 1 * 2 + a * 4 - a lexes because we've imported ID-SYNTAX, but terms of sort Id can't appear in a valid derivation of Exp), we see the following message from the new component:

$ kast <(echo '1 * 2 + a * 4') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token 'a' following token '+'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax Exp ::= Exp "+" Exp
    produced partial term:
      `_+__ARITHMETIC-SYNTAX_Exp_Exp_Exp`(`_*__ARITHMETIC-SYNTAX_Exp_Exp_Exp`(#token("1","Int"),#token("2","Int")),#token("<error>","<invalid>"))

	Source(/dev/fd/11)
	Location(1,9,1,10)
	1 |
	  .	        ^

we make a best-effort attempt to construct what the top-level _+_ term might have looked like, placing an <error> token where we expected to see a valid term of sort Exp.

The changes made in this PR are as follows:

  • Update the implementation of EarleyParser to conditionally perform the rendering as above.
  • Add boilerplate code to thread a new option --debug-parse through to the parser when appropriate.
  • Add a test case based on the above arithmetic module.

I have additionally verified with @dkcumming that the implementation as reviewed produces useful information that would have been useful when addressing the original problem; these cases come from the MIR semantics and are therefore too large to include here as tests.

@rv-jenkins rv-jenkins changed the base branch from master to develop October 11, 2023 14:37
@Baltoli Baltoli force-pushed the better-parse-errors branch from bc52ee6 to 2c1546b Compare October 11, 2023 16:45
@radumereuta
Copy link
Contributor

@Baltoli this is the discussion that started this work: https://runtimeverification.slack.com/archives/C7E701MFG/p1694474725891869
This is an issue that keeps track of the way to reproduce the bugs: runtimeverification/mir-semantics#209

@dkcumming
Copy link

@Baltoli I don't have a minimal example, but I can provide real examples from KMIR parsing failures, trivial_casts_rpass (mir, rust). Not chosen for any particular reason, and the example isn't particularly hard to see the problem in the productions but let's look at the output anyway:

k v6.0.133 error message

$ kast ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir 
[Error] Inner Parser: Parse error: unexpected token ':' following token
'ZeroSized'.
	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir)
	Location(65,127,65,128)
	65 |	                                                                         
                          debug x => const ZeroSized:
[closure@ui/trivial_casts-rpass.rs:53:13: 53:22];
	   .	                                                                         
                                                    ^

branch better-parse-errors error message

$ ~/temp/k/result/bin/kast ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir 
[Error] Inner Parser: Parse error: unexpected token ':' following token
'ZeroSized'. Additional parsing diagnostic information:
No top-level production could apply to this input.
	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir)
	Location(65,127,65,128)
	65 |	                                                                         
                          debug x => const ZeroSized:
[closure@ui/trivial_casts-rpass.rs:53:13: 53:22];
	   .	                                                                         
                                                    ^

This example didn't work very well I guess as it couldn't provide more information. I will try another one that successfully returns information. I'll output --debug-tokens since it is relevant.

k v6.0.133 --debug tokens snippet

$ kast --debug-tokens ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir
//-snip-
|"debug"                            | (65,101,65,106) | "debug"             |
|"x"                                | (65,107,65,108) | r"[_a-zA-Z][_a-zA-Z0-9]*"|
|"=>"                               | (65,109,65,111) | "=>"                |
|"const"                            | (65,112,65,117) | "const"             |
|"ZeroSized"                        | (65,118,65,127) | r"[_a-zA-Z][_a-zA-Z0-9]*"|
|":"                                | (65,127,65,128) | ":"                 |
|"["                                | (65,129,65,130) | "["                 |
|"closure"                          | (65,130,65,137) | "closure"           |
|"@"                                | (65,137,65,138) | "@"                 |
|"ui/trivial_casts-rpass.rs:53:13:" | (65,138,65,170) | r"[a-zA-Z0-9_/\\-\\.]*\\.rs:[0-9]+:[0-9]+:"|
|"53:22"                            | (65,171,65,176) | r"[0-9]+:[0-9]+"    |
|"]"                                | (65,176,65,177) | "]"                 |
|";"                                | (65,177,65,178) | ";"                 |
//-snip-

This output is useful because I can confirm what is actually matching to tokens, like I expect keywords such as closure matches the token closure, and it didn't match Identifier which is r"[_a-zA-Z][_a-zA-Z0-9]*". I don't think that in particular is something that has ever happened but sometimes things unexpected matches occur when changing the syntax, and in my experience it means that a different non-terminal is matching all-together. This output has already been extremely useful to me, my workflow is that I generate this output before the change and after and do a vimdiff. The only extra thing that I mentioned when talking with Dwight and Radu here that I thought could help, would be being able to see the ast/non-terminals that arrived at this point. This is for two reasons, something simple like Identifier:r"[_a-zA-Z][_a-zA-Z0-9]*" is a little more legible because you can tell what the regex is from as you read instead of cross referencing. But also say there was spurious matching, it would be good to be able to be able to trace the non-terminals that matched, because these erroneous token matches are downstream of the problem. I don't know how realistic gathering that information and displaying it to the user is, but that's what I meant in the discussion you linked.

branch better-parse-errors error message SECOND TEST

I captured the errors for running kast on each test in compiletest-parse-fail.tsv, and only one test had the intended output with Attempting to apply production:

$ ~/temp/k/result/bin/kast ../src/tests/integration/test-data/compiletest-rs/ui/associated-types/associated-types-normalize-unifield-struct.mir 
[Error] Inner Parser: Parse error: unexpected token 'Offset' following token
'fn'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax FunctionSignature ::= "fn" FunctionPath "(" ParameterList ")" "->"
Type
    produced partial term:
      `fn_(_)->__MIR-SYNTAX_FunctionSignature_FunctionPath_ParameterList_Type`(#token("<error>","<invalid>"),#token("<incomplete>","<invalid>"),#token("<incomplete>","<invalid>"))

	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/associated-types/associated-types-normalize-unifield-struct.mir)
	Location(3,4,3,10)
	3 |	fn Offset::dummy(_1: &Self) -> () {
	  .	   ^~~~~~

I think this is great feedback, it orients me straight away with what production I should go to figure out the problem. So yes, this is very helpful. Not entirely sure why it didn't work for the other tests in the file, most of the failures are coming from a handful or so of problems also.

Really big response, I wasn't intending for it to be that way but hopefully more is better than less.

@Baltoli
Copy link
Contributor Author

Baltoli commented Oct 19, 2023

I have improved the output from this diagnostic; now instead of using only the immediate predecessor parse state, we use the most recent spanning intermediate state. As an example, consider the following production being added to the ARITHMETIC module above:

syntax Exp ::= "AlignOf" "(" Exp ")"

In the initial version Daniel tries above, the following parse error would be produced:

$ kast <(echo -n 'AlignOf(3+)') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token ')' following token '+'. Additional parsing diagnostic information:
No top-level production could apply to this input.
        Source(/dev/fd/11)
        Location(1,11,1,12)
        1 |     
          .               ^

because no production includes the terminal sequence + ), and so the immediate-predecessor parser state does not include any candidate spanning parses. However, if we rewind slightly further, we can get this message instead:

$ kast <(echo -n 'AlignOf(3+)') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token ')' following token '+'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax Exp ::= "AlignOf" "(" Exp ")"
    produced partial term:
      `AlignOf(_)_ARITHMETIC-SYNTAX_Exp_Exp`(#token("<error>","<invalid>"))

        Source(/dev/fd/11)
        Location(1,11,1,12)
        1 |     
          .               ^

In the interests of getting this change merged, and keeping the diff small, this is probably as involved of a heuristic as I'm prepared to implement for the time being. If in the future we identify a case where this heuristic is deficient, we can open a new PR to extend the existing diagnostic implementation.

@Baltoli Baltoli marked this pull request as ready for review October 20, 2023 10:59
Copy link
Contributor

@radumereuta radumereuta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java part lgtm.
I would like it if Daniel also approves this since he's the one who reported the issue in the first place.
Quick question. What happens if the term is very large? Is it going to pretty print everything?

@Baltoli
Copy link
Contributor Author

Baltoli commented Oct 20, 2023

Daniel has tested the latest version of the output and confirmed to me that it's useful via DM; if the term is large then the AST printed will indeed also be large.

@rv-jenkins rv-jenkins merged commit cc3d089 into develop Oct 20, 2023
@rv-jenkins rv-jenkins deleted the better-parse-errors branch October 20, 2023 22:24
@Baltoli Baltoli mentioned this pull request Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to provide partial parse tree in parse error message
5 participants