Add option to print better parse errors #3700

Baltoli · 2023-10-11T14:37:28Z

This PR fixes #3672 by providing a best-effort rendering of the partial state of the parser when a parse error happens. To do this, we grab the Earley state corresponding to the span [0, token_index_before_error] and inspect the partial parse trees available, as well as the production being attempted at this point.

Consider as a simple example the following standard K definition for arithmetic expressions:

module ARITHMETIC-SYNTAX
  imports UNSIGNED-INT-SYNTAX
  imports ID-SYNTAX

  syntax Exp ::= Int
               | "-" Exp      [group(neg), strict]
               | Exp "+" Exp  [left, group(add), strict]
               | Exp "-" Exp  [left, group(sub), strict]
               | Exp "*" Exp  [left, group(mul), strict]
               | Exp "/" Exp  [left, group(div), strict]
               | "(" Exp ")"  [bracket]

  syntax priorities neg > mul div > add sub
endmodule

module ARITHMETIC
  imports ARITHMETIC-SYNTAX
endmodule

On trying to kast an invalid term (for example, 1 * 2 + a * 4 - a lexes because we've imported ID-SYNTAX, but terms of sort Id can't appear in a valid derivation of Exp), we see the following message from the new component:

$ kast <(echo '1 * 2 + a * 4') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token 'a' following token '+'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax Exp ::= Exp "+" Exp
    produced partial term:
      `_+__ARITHMETIC-SYNTAX_Exp_Exp_Exp`(`_*__ARITHMETIC-SYNTAX_Exp_Exp_Exp`(#token("1","Int"),#token("2","Int")),#token("<error>","<invalid>"))

	Source(/dev/fd/11)
	Location(1,9,1,10)
	1 |
	  .	        ^

we make a best-effort attempt to construct what the top-level _+_ term might have looked like, placing an <error> token where we expected to see a valid term of sort Exp.

The changes made in this PR are as follows:

Update the implementation of EarleyParser to conditionally perform the rendering as above.
Add boilerplate code to thread a new option --debug-parse through to the parser when appropriate.
Add a test case based on the above arithmetic module.

I have additionally verified with @dkcumming that the implementation as reviewed produces useful information that would have been useful when addressing the original problem; these cases come from the MIR semantics and are therefore too large to include here as tests.

radumereuta · 2023-10-12T15:42:04Z

@Baltoli this is the discussion that started this work: https://runtimeverification.slack.com/archives/C7E701MFG/p1694474725891869
This is an issue that keeps track of the way to reproduce the bugs: runtimeverification/mir-semantics#209

dkcumming · 2023-10-13T00:58:12Z

@Baltoli I don't have a minimal example, but I can provide real examples from KMIR parsing failures, trivial_casts_rpass (mir, rust). Not chosen for any particular reason, and the example isn't particularly hard to see the problem in the productions but let's look at the output anyway:

k v6.0.133 error message

$ kast ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir 
[Error] Inner Parser: Parse error: unexpected token ':' following token
'ZeroSized'.
	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir)
	Location(65,127,65,128)
	65 |	                                                                         
                          debug x => const ZeroSized:
[closure@ui/trivial_casts-rpass.rs:53:13: 53:22];
	   .	                                                                         
                                                    ^

branch `better-parse-errors` error message

$ ~/temp/k/result/bin/kast ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir 
[Error] Inner Parser: Parse error: unexpected token ':' following token
'ZeroSized'. Additional parsing diagnostic information:
No top-level production could apply to this input.
	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir)
	Location(65,127,65,128)
	65 |	                                                                         
                          debug x => const ZeroSized:
[closure@ui/trivial_casts-rpass.rs:53:13: 53:22];
	   .	                                                                         
                                                    ^

This example didn't work very well I guess as it couldn't provide more information. I will try another one that successfully returns information. I'll output --debug-tokens since it is relevant.

k v6.0.133 --debug tokens snippet

$ kast --debug-tokens ../src/tests/integration/test-data/compiletest-rs/ui/trivial_casts-rpass.mir
//-snip-
|"debug"                            | (65,101,65,106) | "debug"             |
|"x"                                | (65,107,65,108) | r"[_a-zA-Z][_a-zA-Z0-9]*"|
|"=>"                               | (65,109,65,111) | "=>"                |
|"const"                            | (65,112,65,117) | "const"             |
|"ZeroSized"                        | (65,118,65,127) | r"[_a-zA-Z][_a-zA-Z0-9]*"|
|":"                                | (65,127,65,128) | ":"                 |
|"["                                | (65,129,65,130) | "["                 |
|"closure"                          | (65,130,65,137) | "closure"           |
|"@"                                | (65,137,65,138) | "@"                 |
|"ui/trivial_casts-rpass.rs:53:13:" | (65,138,65,170) | r"[a-zA-Z0-9_/\\-\\.]*\\.rs:[0-9]+:[0-9]+:"|
|"53:22"                            | (65,171,65,176) | r"[0-9]+:[0-9]+"    |
|"]"                                | (65,176,65,177) | "]"                 |
|";"                                | (65,177,65,178) | ";"                 |
//-snip-

This output is useful because I can confirm what is actually matching to tokens, like I expect keywords such as closure matches the token closure, and it didn't match Identifier which is r"[_a-zA-Z][_a-zA-Z0-9]*". I don't think that in particular is something that has ever happened but sometimes things unexpected matches occur when changing the syntax, and in my experience it means that a different non-terminal is matching all-together. This output has already been extremely useful to me, my workflow is that I generate this output before the change and after and do a vimdiff. The only extra thing that I mentioned when talking with Dwight and Radu here that I thought could help, would be being able to see the ast/non-terminals that arrived at this point. This is for two reasons, something simple like Identifier:r"[_a-zA-Z][_a-zA-Z0-9]*" is a little more legible because you can tell what the regex is from as you read instead of cross referencing. But also say there was spurious matching, it would be good to be able to be able to trace the non-terminals that matched, because these erroneous token matches are downstream of the problem. I don't know how realistic gathering that information and displaying it to the user is, but that's what I meant in the discussion you linked.

branch `better-parse-errors` error message SECOND TEST

I captured the errors for running kast on each test in compiletest-parse-fail.tsv, and only one test had the intended output with Attempting to apply production:

$ ~/temp/k/result/bin/kast ../src/tests/integration/test-data/compiletest-rs/ui/associated-types/associated-types-normalize-unifield-struct.mir 
[Error] Inner Parser: Parse error: unexpected token 'Offset' following token
'fn'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax FunctionSignature ::= "fn" FunctionPath "(" ParameterList ")" "->"
Type
    produced partial term:
      `fn_(_)->__MIR-SYNTAX_FunctionSignature_FunctionPath_ParameterList_Type`(#token("<error>","<invalid>"),#token("<incomplete>","<invalid>"),#token("<incomplete>","<invalid>"))

	Source(/home/daniel/Applications/mir-semantics/kmir/k-src/./../src/tests/integration/test-data/compiletest-rs/ui/associated-types/associated-types-normalize-unifield-struct.mir)
	Location(3,4,3,10)
	3 |	fn Offset::dummy(_1: &Self) -> () {
	  .	   ^~~~~~

I think this is great feedback, it orients me straight away with what production I should go to figure out the problem. So yes, this is very helpful. Not entirely sure why it didn't work for the other tests in the file, most of the failures are coming from a handful or so of problems also.

Really big response, I wasn't intending for it to be that way but hopefully more is better than less.

Baltoli · 2023-10-19T13:45:30Z

I have improved the output from this diagnostic; now instead of using only the immediate predecessor parse state, we use the most recent spanning intermediate state. As an example, consider the following production being added to the ARITHMETIC module above:

syntax Exp ::= "AlignOf" "(" Exp ")"

In the initial version Daniel tries above, the following parse error would be produced:

$ kast <(echo -n 'AlignOf(3+)') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token ')' following token '+'. Additional parsing diagnostic information:
No top-level production could apply to this input.
        Source(/dev/fd/11)
        Location(1,11,1,12)
        1 |     
          .               ^

because no production includes the terminal sequence + ), and so the immediate-predecessor parser state does not include any candidate spanning parses. However, if we rewind slightly further, we can get this message instead:

$ kast <(echo -n 'AlignOf(3+)') --no-exc-wrap
[Error] Inner Parser: Parse error: unexpected token ')' following token '+'. Additional parsing diagnostic information:
  Attempting to apply production:
    syntax Exp ::= "AlignOf" "(" Exp ")"
    produced partial term:
      `AlignOf(_)_ARITHMETIC-SYNTAX_Exp_Exp`(#token("<error>","<invalid>"))

        Source(/dev/fd/11)
        Location(1,11,1,12)
        1 |     
          .               ^

In the interests of getting this change merged, and keeping the diff small, this is probably as involved of a heuristic as I'm prepared to implement for the time being. If in the future we identify a case where this heuristic is deficient, we can open a new PR to extend the existing diagnostic implementation.

radumereuta

Java part lgtm.
I would like it if Daniel also approves this since he's the one who reported the issue in the first place.
Quick question. What happens if the term is very large? Is it going to pretty print everything?

Baltoli · 2023-10-20T14:04:19Z

Daniel has tested the latest version of the output and confirmed to me that it's useful via DM; if the term is large then the AST printed will indeed also be large.

kernel/src/main/java/org/kframework/parser/inner/kernel/EarleyParser.java

…k into better-parse-errors

Thread state information into parseError

b84a314

rv-jenkins changed the base branch from master to develop October 11, 2023 14:37

Baltoli added 2 commits October 11, 2023 15:47

Start to thread option through

752ea57

Construct nicer partial ASTs

2c1546b

Baltoli force-pushed the better-parse-errors branch from bc52ee6 to 2c1546b Compare October 11, 2023 16:45

Baltoli added 2 commits October 12, 2023 11:13

More error messages

0c2a87f

Improve formatting of error messages

f016901

Baltoli added 2 commits October 19, 2023 12:41

Merge branch 'develop' into better-parse-errors

7141ec2

Improve some diagnostics

5e0f148

Baltoli added 4 commits October 19, 2023 14:45

Merge branch 'develop' into better-parse-errors

70fc350

Thread option correctly

d3e676e

Correctly thread option

5a7a5bd

Add test case

581f68c

Baltoli requested review from dwightguth and radumereuta October 20, 2023 10:59

Baltoli marked this pull request as ready for review October 20, 2023 10:59

Baltoli added 2 commits October 20, 2023 11:59

Merge branch 'develop' into better-parse-errors

dcc28d7

Fix paths

9ed0c96

radumereuta approved these changes Oct 20, 2023

View reviewed changes

Merge branch 'develop' into better-parse-errors

53bf359

dwightguth reviewed Oct 20, 2023

View reviewed changes

kernel/src/main/java/org/kframework/parser/inner/kernel/EarleyParser.java Outdated Show resolved Hide resolved

Baltoli added 3 commits October 20, 2023 17:14

Don't try to produce error children

9aa8c47

Merge branch 'better-parse-errors' of github.com:runtimeverification/…

88581b3

…k into better-parse-errors

Merge branch 'develop' into better-parse-errors

4014833

dwightguth approved these changes Oct 20, 2023

View reviewed changes

Baltoli added the automerge label Oct 20, 2023

Merge branch 'develop' into better-parse-errors

08ec6db

rv-jenkins merged commit cc3d089 into develop Oct 20, 2023

rv-jenkins deleted the better-parse-errors branch October 20, 2023 22:24

radumereuta mentioned this pull request Oct 23, 2023

K version bump feature list (6.1.0) #3706

Closed

5 tasks

Baltoli mentioned this pull request Dec 12, 2023

2023 Goals #3098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to print better parse errors #3700

Add option to print better parse errors #3700

Baltoli commented Oct 11, 2023 •

edited

Loading

radumereuta commented Oct 12, 2023

dkcumming commented Oct 13, 2023

Baltoli commented Oct 19, 2023

radumereuta left a comment

Baltoli commented Oct 20, 2023

Add option to print better parse errors #3700

Add option to print better parse errors #3700

Conversation

Baltoli commented Oct 11, 2023 • edited Loading

radumereuta commented Oct 12, 2023

dkcumming commented Oct 13, 2023

k v6.0.133 error message

branch better-parse-errors error message

k v6.0.133 --debug tokens snippet

branch better-parse-errors error message SECOND TEST

Baltoli commented Oct 19, 2023

radumereuta left a comment

Choose a reason for hiding this comment

Baltoli commented Oct 20, 2023

Baltoli commented Oct 11, 2023 •

edited

Loading

branch `better-parse-errors` error message

branch `better-parse-errors` error message SECOND TEST