Add the Unicode for String Processing proposal #1631

natecook1000 · 2022-04-22T19:15:11Z

This proposal is another component in the larger regex-powered string processing project.

proposals/0000-unicode-for-string-processing.md

Co-authored-by: Remy Demarest <[email protected]>

proposals/0000-unicode-for-string-processing.md

Remove "ASCII mode" language, since we don't really have a single ASCII mode, and centralize discussion in the `asciiOnlyClasses(_:)` option section.

In particular, range behavior has a correct description here.

Remove description of \O

This table, and the accompanying prose, describe the way each Unicode scalar property is extended to characters.

milseman · 2022-06-02T14:09:28Z

proposals/0000-unicode-for-string-processing.md

+let regex5 = /(?i)ba(?-i:na)na/
+```
+
+All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported.


-1. This clearly depends on the option. String shouldn't accumulate senseless API that sounds useful but is a nop.

Options:

Put on Regex and future work is a protocol

Do the protocol now

milseman · 2022-06-02T14:13:09Z

proposals/0000-unicode-for-string-processing.md

+str.firstMatch(of: /CAFÉ/)          // nil
+str.firstMatch(of: /(?i)CAFÉ/)      // "Café"
+str.firstMatch(of: /(?i)cAfÉ/)      // "Café"
+```


This seems too interior-syntax heavy and we don't get to see the API name in use. What about having an additional line that uses .ignoresCase() after the literal?

Sure, I'll update to include API usage throughout this section. The goal is here is dual — we're explaining the semantics of the option and the two ways of applying it.

milseman · 2022-06-02T14:14:07Z

proposals/0000-unicode-for-string-processing.md

+* `D`: Match only ASCII members for `\d`, `\p{Digit}`, `\p{HexDigit}`, `[:digit:]`, and `CharacterClass.digit`.
+* `S`: Match only ASCII members for `\s`, `\p{Space}`, `[:space:]`, and any of the whitespace-representing `CharacterClass` members.
+* `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, and `CharacterClass.word`. Also only considers ASCII characters for `\b`, `\B`, and `Anchor.wordBoundary`.
+* `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`).


What are these letters? Again, no mention or showing of the actual API being proposed.

milseman · 2022-06-02T14:16:54Z

proposals/0000-unicode-for-string-processing.md

+
+#### Unicode word boundaries
+
+By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries]. 


I think it would be useful to point out that Swift is deviating from the norm by setting this option by default. The default in the table doesn't make it clear that we're following Unicode's "default" instead of regular expression "default". Might be better to present it as Swift is enabling this advanced Unicode option by default, you can disable it for compatibility.

milseman · 2022-06-02T14:19:18Z

proposals/0000-unicode-for-string-processing.md

+// Prints "false"
+```
+
+With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a `String` instance.


The first sentence packs a lot. It seems like its own paragraph if not section to unpack what it implies. Also, what's special about captures that isn't special about anything else in a regex? Why isn't there a boundary around an alternation, character class, etc?

Maybe a sub-section about the outputs of a regex and how those indices are aligned.

milseman · 2022-06-02T14:54:02Z

proposals/0000-unicode-for-string-processing.md

+  /// A character class that matches any element that is classified as
+  /// whitespace.
+  ///
+  /// This character class is equivalent to `\s` in regex syntax.


Is this affected by ascii only? Yes/no? Should there be an additional one that is or isn't affected?

milseman · 2022-06-02T14:55:04Z

proposals/0000-unicode-for-string-processing.md

+  /// Calling this method with a group of Unicode scalars is equivalent to
+  /// listing them in a custom character class in regex syntax.
+  public static func anyOf<S: Sequence>(_ s: S) -> CharacterClass
+    where S.Element == UnicodeScalar


What's the difference between anyOf(str) and anyOf(str.unicodeScalars)?

milseman · 2022-06-02T14:56:16Z

proposals/0000-unicode-for-string-processing.md

+
+An earlier draft of this proposal included a metacharacter and `CharacterClass` API for matching an individual Unicode scalar value, regardless of the current matching level, as a counterpart to `\X`/`.anyGraphemeCluster`. The behavior of this character class, particularly when matching with grapheme cluster semantics, is still unclear at this time, however. For example, when matching the expression `\O*`, does the implict grapheme boundary assertion apply between the `\O` and the quantification operator, or should we treat the two as a single unit and apply the assertion after the `*`?
+
+At the present time, we prefer to allow authors to write regexes that explicitly shift into and out of Unicode scalar mode, where those kinds of decisions are handled by the explicit scope of the setting. If common patterns emerge that indicate some version of `\O` would be useful, we can add it in the future.


Ok, but what is the behavior of \O if we're not erroring out?

We should be erroring out on \O if it isn't supported.

milseman · 2022-06-02T14:56:46Z

proposals/0000-unicode-for-string-processing.md

+
+### Additional protocol to limit option methods
+
+The option-setting methods, like `ignoresCase()`, are implemented as extensions of the `RegexComponent` protocol instead of only on the `Regex` type. This provides convenience when working with `RegexBuilder` syntax, as you don't need to add an additional `Regex { ... }` block around a quantifier or other grouping scope that you want to have a particular behavior. However, it means that the option methods are also available on some types for which their meaning is unclear. In particular, with the `RegexBuilder` module imported, `String` has `RegexComponent` conformance, meaning someone can write nonsensical code like `"literal string".defaultRepetitionBehavior(.possessive)`.


The meaning isn't unclear, it's clearly meaningless.

milseman · 2022-06-02T14:59:37Z

proposals/0000-unicode-for-string-processing.md

+str.firstMatch(of: /(?i)cAfÉ/)      // "Café"
+```
+
+Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected.


Seems like there's some assumptions hidden in this sentence. Can you expand on that? What if multiple scalars are introduced? What if that happens inside a custom character class? Which notion of case we using? etc.

…de-for-string-processing.md

natecook1000 added 2 commits April 22, 2022 14:14

Add the Unicode for String Processing proposal

112e6f7

Add a TOC

8b3c6e7

PsychoH13 reviewed Apr 23, 2022

View reviewed changes

Apply suggestions from review

b4952ce

Co-authored-by: Remy Demarest <[email protected]>

benrimmington reviewed Apr 28, 2022

View reviewed changes

proposals/0000-unicode-for-string-processing.md Outdated Show resolved Hide resolved

natecook1000 added 19 commits May 24, 2022 11:07

Update with corrections / proposed API

bf7a630

Add information about default options

93863c0

Option setting clarifications.

e433920

Clean up ASCII mode language

180c39d

Remove "ASCII mode" language, since we don't really have a single ASCII mode, and centralize discussion in the `asciiOnlyClasses(_:)` option section.

Add alternative about default options

06e28f8

Add more context about semantic modes

6fd0a4e

More detailed/correct explanation of CCCs

fa10ed7

In particular, range behavior has a correct description here.

Update to defaultRepetitionBehavior, other revisions

9dc7151

Update CharacterClass docs

52106e8

Minor edits

5687131

Reorganize options summary table

fc8c9cf

Correct / improve matching semantics discussion

f476cda

Move regex syntax-specific options lower

ac197b8

Improve/clarify .any/.anyGraphemeCluster

9e10160

Remove description of \O

Improve alternatives

4146a44

Add table showing Unicode property behavior

2f00fc1

This table, and the accompanying prose, describe the way each Unicode scalar property is extended to characters.

Add additional alternatives and future work

72b9391

Update errant Unicode property strategies

1f5d296

Add version history

c282108

natecook1000 marked this pull request as ready for review May 31, 2022 19:32

Merge branch 'main' into unicode_string_processing

726c4f1

milseman suggested changes Jun 2, 2022

View reviewed changes

natecook1000 and others added 3 commits June 14, 2022 14:43

Revisions based on feedback

d4e2aa4

Add a note about Swift regex differences

8aeeb6f

Update and rename 0000-unicode-for-string-processing.md to 0363-unico…

e12e766

…de-for-string-processing.md

natecook1000 and others added 2 commits June 25, 2022 13:09

Finish sentence about future APIs

42ba5b4

Update 0363-unicode-for-string-processing.md

77e533c

airspeedswift merged commit 1b219be into swiftlang:main Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the Unicode for String Processing proposal #1631

Add the Unicode for String Processing proposal #1631

natecook1000 commented Apr 22, 2022

milseman Jun 2, 2022

milseman Jun 3, 2022

milseman Jun 2, 2022

natecook1000 Jun 2, 2022

milseman Jun 2, 2022

milseman Jun 2, 2022

milseman Jun 2, 2022

milseman Jun 3, 2022

milseman Jun 2, 2022

milseman Jun 2, 2022

milseman Jun 2, 2022

natecook1000 Jun 2, 2022

milseman Jun 2, 2022

milseman Jun 2, 2022


		#### Unicode word boundaries

		By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries].


		An earlier draft of this proposal included a metacharacter and `CharacterClass` API for matching an individual Unicode scalar value, regardless of the current matching level, as a counterpart to `\X`/`.anyGraphemeCluster`. The behavior of this character class, particularly when matching with grapheme cluster semantics, is still unclear at this time, however. For example, when matching the expression `\O`, does the implict grapheme boundary assertion apply between the `\O` and the quantification operator, or should we treat the two as a single unit and apply the assertion after the ``?

		At the present time, we prefer to allow authors to write regexes that explicitly shift into and out of Unicode scalar mode, where those kinds of decisions are handled by the explicit scope of the setting. If common patterns emerge that indicate some version of `\O` would be useful, we can add it in the future.


		### Additional protocol to limit option methods

		The option-setting methods, like `ignoresCase()`, are implemented as extensions of the `RegexComponent` protocol instead of only on the `Regex` type. This provides convenience when working with `RegexBuilder` syntax, as you don't need to add an additional `Regex { ... }` block around a quantifier or other grouping scope that you want to have a particular behavior. However, it means that the option methods are also available on some types for which their meaning is unclear. In particular, with the `RegexBuilder` module imported, `String` has `RegexComponent` conformance, meaning someone can write nonsensical code like `"literal string".defaultRepetitionBehavior(.possessive)`.

Add the Unicode for String Processing proposal #1631

Add the Unicode for String Processing proposal #1631

Conversation

natecook1000 commented Apr 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment