Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document unsupported syntax #36

Open
4 tasks
alysbrooks opened this issue Oct 5, 2021 · 4 comments
Open
4 tasks

Document unsupported syntax #36

alysbrooks opened this issue Oct 5, 2021 · 4 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@alysbrooks
Copy link
Member

Most of the syntax we don't support is pretty rare (especially once lazy qualifiers are added), but it would be nice to document what we don't support. I think we can do Java 8 & 9, ECMA, and RE2 in that order.

  • Java 8
  • Java 9
  • ECMA
  • RE2
@alysbrooks alysbrooks added documentation Improvements or additions to documentation good first issue Good for newcomers labels Oct 5, 2021
@cons-dev
Copy link

cons-dev commented Dec 20, 2021

Hi! So I initially looked at what was counted as unsupported and it seems like there is some stuff to just automatically flag it, though it doesn't always work, so I went through this to find things that weren't implemented or didn't appear to be implemented in an expected way.

So to get the list below I went through the various examples (obviously with other stuff added if needed) and parsed it to see if the parse function would work.

So far I go this from Java8 just as a sort of preliminary thing:

  • The \B and \b anchors are currently not supported in any version.
  • The union, intersection, and difference operations on character classes (eg. [0-3[4-5]], [0-9&&[345]], and [0-9&&[^6-7]] respectively) are not supported in Java8 or Java9, I don't know if ECMA has them, and these seem to be "silently" unsupported despite being valid syntax. (should I make an issue for this?)
  • Ocatal digits are not supported for either Java8 or Java9.
  • POSIX character classes are not implemented in any version.
  • java.lang.character classes are not implemented
  • classes for unicode scripts are not implemented
  • The horizontal whitespace class is not currently implemented.
  • The \Z anchor is not supported in any version
  • The \G anchor is also not supported in any version
  • The named capturing group is not implemented
  • The (?idmsux-idmsux:X) and other similar flags does not work on ECMA script (obviously) and doesn't seem to be implemented for Java8 or Java9 (or at least it doesn't seem to produce code that would ignore flags, unless this is a default behavior)

So I guess the question is what I should do from here. I think implementing a function to automatically check functionality by checking that the parsed regex behaves as expected when compared to the initial regex (and maybe generate the documentation for unsupported features) might be a good idea, but I would like some input.

(the specific reasoning behind the automatic documentation generation / testing is because some people presumably want to implement these things, so the docs might become out of date relative to the implementations)

@plexus
Copy link
Member

plexus commented Dec 21, 2021

Hey, I commented on the PR before seeing this. Let me try to add a bit more context.

The \B and \b anchors are currently not supported in any version.

If these exist on all targets then they would be great to add.

The union, intersection, and difference operations on character classes (eg. [0-3[4-5]], [0-9&&[345]], and [0-9&&[^6-7]] respectively) are not supported in Java8 or Java9, I don't know if ECMA has them, and these seem to be "silently" unsupported despite being valid syntax. (should I make an issue for this?)

These are already fairly exotic, as I mentioned on the PR we are more focused on solidifying what we have and maybe adding very frequently used features. I think a lot of devs don't even know these are a thing, so low priority.

Ocatal digits are not supported for either Java8 or Java9.

This is one of the things where you need to look at it from the Regal syntax perspective. It doesn't make sense to support these because they're just another way to write the same thing. If we wanted to "support" this would we take an integer but treat it as an octal? You can use [:char 8r123] instead.

POSIX character classes are not implemented in any version.

Similar reasoning as above. They are aliases. What's also problematically is that these can be subtly different between platforms that have them, or even between different Java versions. What I could imagine happening is that we add constants for the equivalent regal forms, so people get the same convenience, but in a cross-platform regal-first kind of way.

java.lang.character classes are not implemented

I'm not sure what these are, but probably a similar reasoning as with POSIX.

classes for unicode scripts are not implemented

Same

The horizontal whitespace class is not currently implemented.

I didn't even realize this was a separate thing from just "whitespace". Could be worth adding. If all platforms implement it the same way then it's easy to do, otherwise it gets more elaborate because then we want to settle on the semantics that Regal will adopt, and output an equivalent regex on platforms that don't match that (e.g. [\t\p{Zs}]), while also maintaining parsing for \h and round-trip. At least that's what we do for instance for \v. You could also skip all that and just only deal with the equivalent form, but in that case it could just be a constant.

The \Z anchor is not supported in any version

This one's tricky because it depends on if the regex is multiline or not, we currently don't support any flags like that, so for us \Z and $ are equivalent.

The \G anchor is also not supported in any version

Today I learned! Seems low priority, would need research into cross-platform support.

The named capturing group is not implemented

Deliberately not implemented due to lack of cross platform support, although that could change, some JS implementations do have it now.

The (?idmsux-idmsux:X) and other similar flags does not work on ECMA script (obviously) and doesn't seem to be implemented for Java8 or Java9 (or at least it doesn't seem to produce code that would ignore flags, unless this is a default behavior)

Deliberately not supported at the moment because we don't have a good cross-platform story, and I feel like it would generally complicate things for us since it influences the semantics of certain forms. On the other hand some of these are very commonly used and useful, so I do feel we need to find a good story for at least the common ones at some point.

So I guess the question is what I should do from here. I think implementing a function to automatically check functionality by checking that the parsed regex behaves as expected when compared to the initial regex (and maybe generate the documentation for unsupported features) might be a good idea, but I would like some input.

As I pointed out on the PR we do have a bunch of tests that are driven by data, which try to validate the individual parts of the library based off of the same data. But these are formulated starting from the Regal form, which is by design. Still I could imagine adding cases in there for future extension. Something like

 :horizontal-whitespace
 ^:not-implemented [_ ["\\h" "\t"] ["\\h" " "]]

Here I've omitted the regal form (_), but adding some examples, marking this case as :not-implemented, so we could then actually have a test that fails if this becomes implemented. Maybe we should also stick some docstrings in there or other info to help build up a table of features.... just some ideas.

@alysbrooks
Copy link
Member Author

I think this makes sense to document, so I think it's reasonable to keep open, especially since we have notes from Arne.

@marksto
Copy link

marksto commented Dec 7, 2024

@plexus Hi Arne! I wonder why named capturing groups are not implemented? Why is cross-platform support missing?

It looks like:

Am I missing something obvious? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
Status: Candidate
Development

No branches or pull requests

4 participants