Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Unicode for String Processing proposal #1631

Merged
merged 28 commits into from
Jun 27, 2022

Conversation

natecook1000
Copy link
Member

This proposal is another component in the larger regex-powered string processing project.

Co-authored-by: Remy Demarest <[email protected]>
@natecook1000 natecook1000 marked this pull request as ready for review May 31, 2022 19:32
let regex5 = /(?i)ba(?-i:na)na/
```

All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1. This clearly depends on the option. String shouldn't accumulate senseless API that sounds useful but is a nop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options:

  1. Put on Regex and future work is a protocol
  2. Do the protocol now

str.firstMatch(of: /CAFÉ/) // nil
str.firstMatch(of: /(?i)CAFÉ/) // "Café"
str.firstMatch(of: /(?i)cAfÉ/) // "Café"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too interior-syntax heavy and we don't get to see the API name in use. What about having an additional line that uses .ignoresCase() after the literal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update to include API usage throughout this section. The goal is here is dual — we're explaining the semantics of the option and the two ways of applying it.

* `D`: Match only ASCII members for `\d`, `\p{Digit}`, `\p{HexDigit}`, `[:digit:]`, and `CharacterClass.digit`.
* `S`: Match only ASCII members for `\s`, `\p{Space}`, `[:space:]`, and any of the whitespace-representing `CharacterClass` members.
* `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, and `CharacterClass.word`. Also only considers ASCII characters for `\b`, `\B`, and `Anchor.wordBoundary`.
* `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these letters? Again, no mention or showing of the actual API being proposed.


#### Unicode word boundaries

By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to point out that Swift is deviating from the norm by setting this option by default. The default in the table doesn't make it clear that we're following Unicode's "default" instead of regular expression "default". Might be better to present it as Swift is enabling this advanced Unicode option by default, you can disable it for compatibility.

// Prints "false"
```

With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a `String` instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence packs a lot. It seems like its own paragraph if not section to unpack what it implies. Also, what's special about captures that isn't special about anything else in a regex? Why isn't there a boundary around an alternation, character class, etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a sub-section about the outputs of a regex and how those indices are aligned.

/// A character class that matches any element that is classified as
/// whitespace.
///
/// This character class is equivalent to `\s` in regex syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this affected by ascii only? Yes/no? Should there be an additional one that is or isn't affected?

/// Calling this method with a group of Unicode scalars is equivalent to
/// listing them in a custom character class in regex syntax.
public static func anyOf<S: Sequence>(_ s: S) -> CharacterClass
where S.Element == UnicodeScalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between anyOf(str) and anyOf(str.unicodeScalars)?


An earlier draft of this proposal included a metacharacter and `CharacterClass` API for matching an individual Unicode scalar value, regardless of the current matching level, as a counterpart to `\X`/`.anyGraphemeCluster`. The behavior of this character class, particularly when matching with grapheme cluster semantics, is still unclear at this time, however. For example, when matching the expression `\O*`, does the implict grapheme boundary assertion apply between the `\O` and the quantification operator, or should we treat the two as a single unit and apply the assertion after the `*`?

At the present time, we prefer to allow authors to write regexes that explicitly shift into and out of Unicode scalar mode, where those kinds of decisions are handled by the explicit scope of the setting. If common patterns emerge that indicate some version of `\O` would be useful, we can add it in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but what is the behavior of \O if we're not erroring out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be erroring out on \O if it isn't supported.


### Additional protocol to limit option methods

The option-setting methods, like `ignoresCase()`, are implemented as extensions of the `RegexComponent` protocol instead of only on the `Regex` type. This provides convenience when working with `RegexBuilder` syntax, as you don't need to add an additional `Regex { ... }` block around a quantifier or other grouping scope that you want to have a particular behavior. However, it means that the option methods are also available on some types for which their meaning is unclear. In particular, with the `RegexBuilder` module imported, `String` has `RegexComponent` conformance, meaning someone can write nonsensical code like `"literal string".defaultRepetitionBehavior(.possessive)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning isn't unclear, it's clearly meaningless.

str.firstMatch(of: /(?i)cAfÉ/) // "Café"
```

Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there's some assumptions hidden in this sentence. Can you expand on that? What if multiple scalars are introduced? What if that happens inside a custom character class? Which notion of case we using? etc.

@airspeedswift airspeedswift merged commit 1b219be into swiftlang:main Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants