Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented/Unspecified Behavior #67

Open
AlexanderGH opened this issue Jan 22, 2025 · 5 comments
Open

Undocumented/Unspecified Behavior #67

AlexanderGH opened this issue Jan 22, 2025 · 5 comments

Comments

@AlexanderGH
Copy link

I hacked together a quick greenfield + generalized implementation of keep-sorted in another language based on the README documentation and wanted to verify compatibility with this official go implementation. I found some interesting cases where the go implementation has some unspecified/inconsistent outputs/parsing from the goldens:

  • by_regex.in: keep-sorted-test start block=yes newline_separated=yes by_regex=(\w+)\(\)\s+{

    • The regex seems to be missing an excape for the repitition group character required by some regex parsers (one exasmple is Java's)
    • java.util.regex.PatternSyntaxException: Illegal repetition near index 13 (\w+)\(\)\s+{
    • This just needs a simple one-char fix (add a \ before the {)
  • groups.in: Nested keep-sorted without indentation

    • This example has a trailing new line which is inconsistent with all the other goldens which only use it as an "inner" separator.
    • This is the only undocumented behavior from the goldens I wasn't able to easily emulate because it goes directly against the documentation.
    • In my implementation there is no trailing new line after the last block.
  • ignore_prefixes.in: Additional prefixes aren't counted as part of the comment:

    • Unlike the readme, the first prefix doesn't contain a space, but yet it's removed, resulting in a different sort order than i'd expect

I also hit a few other "weird" examples, e.g. by_regex.in: Prefix order seems to have ambiguity in terms of the regex anchoring \w+_(\w+) could match either DO_MORE or MORE_STUFF in DO_MORE_STUFF (looks like go matches the former, and java the latter), But in those examples, there's a second undocumented behavior: in the case of ties (e.g. multiple init/final) after the regex substitution, the original string seems to be used as a tie-breaker (based on the goldens. i didn't look at the source) rather than using a stable sort and retaning the original order (foo ends up after bar even though in the original foo comes before bar).

  • The trailing comma handling seems to be an undocumented feature.
  • String handling seems to be an undocumented feature.
  • Nested keep-sorted behavior

I suspect some of these are just remnants of one-off google3 bug-fixes/feature additions to make it work in a lot of real-woorld cases (which sadly seems to add more language syntax/lexing knowledge than i expected) but just wanted to see if they could be documented (or fixed to aslign with the docs).

@JeffFaer
Copy link
Collaborator

Awesome!

What was the goal of putting together a new implementation of keep-sorted? In what ways did you generalize it?


What happens if you use https://github.com/google/re2j for regular expressions in Java? I suspect that would make some of those differences go away.

This example has a trailing new line which is inconsistent with all the other goldens which only use it as an "inner" separator.

This is a remnant of a one-off google3 bug fix :)

// Keep any blank lines leading up to the end tag by simply excluding
// them from being sorted (any at the beginning should already be sorted
// at the top).
// The original justification for this was better handling of markdown
// lists (cr/423863898), but the markdown formatter doesn't seem to care
// about the newlines anymore.
// It's nice to keep this around so that users can add a little extra
// formatting to their keep-sorted blocks.

ignore_prefixes.in: Additional prefixes aren't counted as part of the comment:

Uhhh, I see that snippet in comments.in, not ignore_prefixes.in:

Additional prefixes aren't counted as part of the comment:

the first prefix doesn't contain a space, but yet it's removed, resulting in a different sort order than i'd expect

That sounds like a recent bug fix: #55

As a human, I'm not sure I would actually want ' foo' to sort before 'bar'. Humans don't tend to sort based on whitespace

The trailing comma handling seems to be an undocumented feature.

Indeed, it's not documented. There's also an outstanding issue to make some improvements in that area: #33

String handling seems to be an undocumented feature.

Assuming the string handling you're talking about is how group=yes doesn't count braces inside string literals, this could perhaps be documented more clearly: https://github.com/google/keep-sorted?tab=readme-ov-file#blocks

Nested keep-sorted behavior

https://github.com/google/keep-sorted?tab=readme-ov-file#usage mentions that nested keep-sorted blocks are supported. What else would you like to see documented about them?


Taking a step back, I agree that there are some spots where the documentation could be better, but we do need to take the intended audience of the document into account. For instance, I don't think documenting the differences between re2 and other regex engines would be a good use of the README for its intended reader. Perhaps we could have a separate document tailored to an audience that's more interested in implementation details like that, but I would argue that the golden files themselves might be a reasonable enough document to capture nuances like that

@AlexanderGH
Copy link
Author

What was the goal of putting together a new implementation of keep-sorted? In what ways did you generalize it?

Goal was mostly to avoid waiting for #37 and to implement support for a template=xxx option to allow having re-usable sets of options, and allow it to run as part of a broader set of contexts, e.g. ide plugins, linters, suggested fixes in git, etc. My super hacky impl is here.

By generalization i meant something along the lines of going from the generic option descriptions and not hardcoding a lot of the language construct logic.

What happens if you use https://github.com/google/re2j for regular expressions in Java? I suspect that would make some of those differences go away.

The goal was less so to match exactly on the regexp front. I figured escaping { should work in both impls so was hoping it would be a quick fix given the feature is new. For now i'll just do some pre-processing on some of the goldens to escape it before i run the testcase against it. It's basically just that and the new line one so not a huge difference.

This is a remnant of a one-off google3 bug fix :)

Thanks, I'll do that too then i guess. Reverse engineering whitespace/blank line handling from the test goldens was like 50% of the effort and this was the only one i couldn't figure out so figured it must be a google3 one-off and i intentionally did not look at the source when building my solution.

As a human, I'm not sure I would actually want ' foo' to sort before 'bar'. Humans don't tend to sort based on whitespace

I agree, though some tl;dr of how keep-sorted treeats whitespace might be useful to document.

Indeed, it's not documented. There's also an outstanding issue to make some improvements in that area: #33

Funnily enough my impl will have the same behavior even as noted ni that issue though there were no tests or docs covering those comment edge-cases. I think i might just drop trailing comments from the sort key and use the sort key to track commas instead of the original source 🤔

Assuming the string handling you're talking about is how group=yes doesn't count braces inside string literals, this could perhaps be documented more clearly: https://github.com/google/keep-sorted?tab=readme-ov-file#blocks

I saw that but was confused because:

Indeed, it's not documented. There's also an outstanding issue to make some improvements in that area: #33

seems to go against what the tests do (in the tests the \n is ignored i think).

But also, i assumed that comment implied it would break if i had unbalanced braces inside string literals (and was preparing for any backslash about people not being able to use brackets in comments or strings i was going to force sort some files), but only realized it handled them when i ran against the goldens. So it's a good thing, just unexpected.

https://github.com/google/keep-sorted?tab=readme-ov-file#usage mentions that nested keep-sorted blocks are supported. What else would you like to see documented about them?

Sorry, i missed it. I read all the options, but not the intro sections.

For instance, I don't think documenting the differences between re2 and other regex engines would be a good use of the README for its intended reader.

Agreed, my goal was more to just add a \ to the golden file and docs so it should (in theory) work with all parsers rather than an undocumented subset.

I would argue that the golden files themselves might be a reasonable enough document to capture nuances like that.

I agree for this case but i think a more general section with quick FYIs around things like whitespace handling and trailing commas and trailing new lines, and what special language constructs are one-offed might be useful. Basically my goal is that given a golden in file and the docs, one should be able to know what the sorted output should look like (i.e. make it predictable for a human).

@AlexanderGH
Copy link
Author

re2j did work fwiw and all the golden tests now pass in my implementation 👍

@JeffFaer
Copy link
Collaborator

Goal was mostly to avoid waiting for #37 and to implement support for a template=xxx option to allow having re-usable sets of options

Both of those are things I'd like to have in this version of keep-sorted too. If you have the bandwidth I'd love to have contributions for either or both of those :)

work with all [regex] parsers rather than an undocumented subset.

https://github.com/google/keep-sorted?tab=readme-ov-file#regular-expressions does link out to http://godoc/pkg/regexp/syntax/, which actually doesn't mention re2. http://godoc/pkg/regexp/ mentions re2, and perhaps we could just document explicitly that we're using re2


If I'm hearing you correctly, it sounds like these would be the relevant pieces of documentation to add:

  • Clarify that we're using re2
  • some tl;dr of how keep-sorted treats whitespace
  • For trailing commas, tests covering those comment edge-cases mentioned in keep-sorted could handle trailing commas better #33
  • Document string handling behavior when block=yes
  • Document trailing newline behavior (or perhaps refactor it to be an option like skip_lines or drop it entirely now that the internal need for it is gone)

@AlexanderGH
Copy link
Author

Yep, with the final checkbox item being how it treats identical items post processing. It seems like it uses the original unprocessed line as the sort tie-breaker (i thought i saw some example using stable story but i can no longer find it so i was likely mistaken earlier):

by_regex (uses original string to tie-break):
FOO_ before BAR_:

Prefix order

FOO_ after BAR_:
Prefix order

FYI: godoc is a google-internal DNS entry (go.dev is the canonical external redirect domain I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants