-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Unicode for String Processing proposal #1631
Add the Unicode for String Processing proposal #1631
Conversation
Co-authored-by: Remy Demarest <[email protected]>
Remove "ASCII mode" language, since we don't really have a single ASCII mode, and centralize discussion in the `asciiOnlyClasses(_:)` option section.
In particular, range behavior has a correct description here.
Remove description of \O
This table, and the accompanying prose, describe the way each Unicode scalar property is extended to characters.
let regex5 = /(?i)ba(?-i:na)na/ | ||
``` | ||
|
||
All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1. This clearly depends on the option. String shouldn't accumulate senseless API that sounds useful but is a nop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options:
- Put on
Regex
and future work is a protocol - Do the protocol now
str.firstMatch(of: /CAFÉ/) // nil | ||
str.firstMatch(of: /(?i)CAFÉ/) // "Café" | ||
str.firstMatch(of: /(?i)cAfÉ/) // "Café" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too interior-syntax heavy and we don't get to see the API name in use. What about having an additional line that uses .ignoresCase()
after the literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll update to include API usage throughout this section. The goal is here is dual — we're explaining the semantics of the option and the two ways of applying it.
* `D`: Match only ASCII members for `\d`, `\p{Digit}`, `\p{HexDigit}`, `[:digit:]`, and `CharacterClass.digit`. | ||
* `S`: Match only ASCII members for `\s`, `\p{Space}`, `[:space:]`, and any of the whitespace-representing `CharacterClass` members. | ||
* `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, and `CharacterClass.word`. Also only considers ASCII characters for `\b`, `\B`, and `Anchor.wordBoundary`. | ||
* `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these letters? Again, no mention or showing of the actual API being proposed.
|
||
#### Unicode word boundaries | ||
|
||
By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to point out that Swift is deviating from the norm by setting this option by default. The default
in the table doesn't make it clear that we're following Unicode's "default" instead of regular expression "default". Might be better to present it as Swift is enabling this advanced Unicode option by default, you can disable it for compatibility.
// Prints "false" | ||
``` | ||
|
||
With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a `String` instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence packs a lot. It seems like its own paragraph if not section to unpack what it implies. Also, what's special about captures that isn't special about anything else in a regex? Why isn't there a boundary around an alternation, character class, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a sub-section about the outputs of a regex and how those indices are aligned.
/// A character class that matches any element that is classified as | ||
/// whitespace. | ||
/// | ||
/// This character class is equivalent to `\s` in regex syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this affected by ascii only? Yes/no? Should there be an additional one that is or isn't affected?
/// Calling this method with a group of Unicode scalars is equivalent to | ||
/// listing them in a custom character class in regex syntax. | ||
public static func anyOf<S: Sequence>(_ s: S) -> CharacterClass | ||
where S.Element == UnicodeScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between anyOf(str)
and anyOf(str.unicodeScalars)
?
|
||
An earlier draft of this proposal included a metacharacter and `CharacterClass` API for matching an individual Unicode scalar value, regardless of the current matching level, as a counterpart to `\X`/`.anyGraphemeCluster`. The behavior of this character class, particularly when matching with grapheme cluster semantics, is still unclear at this time, however. For example, when matching the expression `\O*`, does the implict grapheme boundary assertion apply between the `\O` and the quantification operator, or should we treat the two as a single unit and apply the assertion after the `*`? | ||
|
||
At the present time, we prefer to allow authors to write regexes that explicitly shift into and out of Unicode scalar mode, where those kinds of decisions are handled by the explicit scope of the setting. If common patterns emerge that indicate some version of `\O` would be useful, we can add it in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but what is the behavior of \O
if we're not erroring out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be erroring out on \O
if it isn't supported.
|
||
### Additional protocol to limit option methods | ||
|
||
The option-setting methods, like `ignoresCase()`, are implemented as extensions of the `RegexComponent` protocol instead of only on the `Regex` type. This provides convenience when working with `RegexBuilder` syntax, as you don't need to add an additional `Regex { ... }` block around a quantifier or other grouping scope that you want to have a particular behavior. However, it means that the option methods are also available on some types for which their meaning is unclear. In particular, with the `RegexBuilder` module imported, `String` has `RegexComponent` conformance, meaning someone can write nonsensical code like `"literal string".defaultRepetitionBehavior(.possessive)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning isn't unclear, it's clearly meaningless.
str.firstMatch(of: /(?i)cAfÉ/) // "Café" | ||
``` | ||
|
||
Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there's some assumptions hidden in this sentence. Can you expand on that? What if multiple scalars are introduced? What if that happens inside a custom character class? Which notion of case we using? etc.
This proposal is another component in the larger regex-powered string processing project.