Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link words from Chinese Japanese and Korean #13

Closed
wants to merge 1 commit into from
Closed

Link words from Chinese Japanese and Korean #13

wants to merge 1 commit into from

Conversation

trinity-1686a
Copy link

fix #12

@withoutboats
Copy link
Owner

Thanks for the PR! I'm a bit ill at ease about doing this, but I'm open to being convinced.

Here are some concerns that come to mind:

  1. Maintainability. I'm worried about the long term implications of continuing to add our own special cases to this library. We currently follow unicode word segmentation rules with some intra"word" segmentations based on common casing conventions. Here, we would begin not following unicode word segmentation rules in some contexts. As we continue to add these ad hoc rules, and as unicode continues to assign more characters, this will just continue to escalate in the complexity of the algorithm.
  2. Correctness. It's clear that in your use case (URL normalization), avoiding the hyphens would be good. But its not clear to me that in all uses, in all casing forms, a sequence of words in these character sets wants to be treated as a single word.
  3. Comprehensibility. Along the same lines as maintainability, we want the behavior of the crate to avoid getting to unpredictable for users. The more special cases we have, the harder it will be for users to understand what this crate actually does.

One immediate improvement I can think of would be to base this distinction on characters being in specific character sets, rather than using unicode ranges. Is there a unicode crate we could depend on that determines the character set of a character, and then are there a set of clear character sets we could apply this rule in?

But I also have these larger concerns, and so I'd like to talk this out more to try to figure out what the best approach is. I'm still on the fence about merging this, please feel free to make the argument.

@trinity-1686a
Copy link
Author

I don't think the argument will be too convincing.

First, about maintainability and comprehensibility, I totally agree with you, having exceptions to handle may be both harder to maintain (for you) and to apprehend (for newcomers).
Then correctness, I don't speak any of the concerned languages, but from what I understood, there should be no space between words in these languages, at the opposite of most western languages, so it might actually be more correct to not put hyphens between each.

I'm not used to work too much with Unicode, but after a little search, I think if there is a crate, it's ucd. However I'm not sure it's even doing what we would need.

I totally understand you position. If this is rejected (and maybe it should be), I think I'll be using the Iterator API you mentioned in #7, if you need any help on that side.

@pickfire
Copy link

pickfire commented Jan 5, 2021

Oh, I didn't noticed about this patch when working on #28.

@pickfire
Copy link

pickfire commented Jan 5, 2021

For chinese word, the original case takes each character as a word, so each character in chinese end up getting a hypen, like abcde becoming a-b-c-d-e. Although this would not take care of characters that cross language boundaries but it should be way more than enough for general characters.

@trinity-1686a
Copy link
Author

Closing as this is supersede by #28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Japanese character more correctly
3 participants