Link words from Chinese Japanese and Korean #13

trinity-1686a · 2018-11-29T16:24:56Z

withoutboats · 2018-11-29T17:06:51Z

Thanks for the PR! I'm a bit ill at ease about doing this, but I'm open to being convinced.

Here are some concerns that come to mind:

Maintainability. I'm worried about the long term implications of continuing to add our own special cases to this library. We currently follow unicode word segmentation rules with some intra"word" segmentations based on common casing conventions. Here, we would begin not following unicode word segmentation rules in some contexts. As we continue to add these ad hoc rules, and as unicode continues to assign more characters, this will just continue to escalate in the complexity of the algorithm.
Correctness. It's clear that in your use case (URL normalization), avoiding the hyphens would be good. But its not clear to me that in all uses, in all casing forms, a sequence of words in these character sets wants to be treated as a single word.
Comprehensibility. Along the same lines as maintainability, we want the behavior of the crate to avoid getting to unpredictable for users. The more special cases we have, the harder it will be for users to understand what this crate actually does.

One immediate improvement I can think of would be to base this distinction on characters being in specific character sets, rather than using unicode ranges. Is there a unicode crate we could depend on that determines the character set of a character, and then are there a set of clear character sets we could apply this rule in?

But I also have these larger concerns, and so I'd like to talk this out more to try to figure out what the best approach is. I'm still on the fence about merging this, please feel free to make the argument.

trinity-1686a · 2018-11-29T17:57:54Z

I don't think the argument will be too convincing.

First, about maintainability and comprehensibility, I totally agree with you, having exceptions to handle may be both harder to maintain (for you) and to apprehend (for newcomers).
Then correctness, I don't speak any of the concerned languages, but from what I understood, there should be no space between words in these languages, at the opposite of most western languages, so it might actually be more correct to not put hyphens between each.

I'm not used to work too much with Unicode, but after a little search, I think if there is a crate, it's ucd. However I'm not sure it's even doing what we would need.

I totally understand you position. If this is rejected (and maybe it should be), I think I'll be using the Iterator API you mentioned in #7, if you need any help on that side.

pickfire · 2021-01-05T16:50:16Z

Oh, I didn't noticed about this patch when working on #28.

pickfire · 2021-01-05T16:58:22Z

For chinese word, the original case takes each character as a word, so each character in chinese end up getting a hypen, like abcde becoming a-b-c-d-e. Although this would not take care of characters that cross language boundaries but it should be way more than enough for general characters.

trinity-1686a · 2021-01-12T01:03:25Z

Closing as this is supersede by #28

Link words from Chinese Japanese and Korean

e773ebc

jplatte mentioned this pull request Jan 5, 2021

Add support for CJK characters #28

Closed

trinity-1686a closed this Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link words from Chinese Japanese and Korean #13

Link words from Chinese Japanese and Korean #13

trinity-1686a commented Nov 29, 2018

withoutboats commented Nov 29, 2018

trinity-1686a commented Nov 29, 2018

pickfire commented Jan 5, 2021

pickfire commented Jan 5, 2021

trinity-1686a commented Jan 12, 2021

Link words from Chinese Japanese and Korean #13

Link words from Chinese Japanese and Korean #13

Conversation

trinity-1686a commented Nov 29, 2018

withoutboats commented Nov 29, 2018

trinity-1686a commented Nov 29, 2018

pickfire commented Jan 5, 2021

pickfire commented Jan 5, 2021

trinity-1686a commented Jan 12, 2021