[Draft][Require RFC] mb_levenshtein function #16043

youkidearitai · 2024-09-25T07:13:37Z

RFC: https://wiki.php.net/rfc/mb_levenshtein

I would like want to multibyte of levenshtein function. So I implemented to mb_levenshtein function. This PR require an RFC.

youkidearitai · 2024-09-25T07:14:25Z

ref: #10180

youkidearitai · 2024-09-25T08:16:04Z

CI is failed. I'll fix as soon as possible.

ext/mbstring/mbstring.c

alexdowad · 2024-09-25T11:46:20Z

@youkidearitai, thanks very much (as always) for your work on mbstring. It looks like the RFC for mb_levenshtein is not yet approved; is that right? If so, I guess this code is a POC showing that the RFC is practical to implement?

The code generally looks good to me. I do think you need more tests, though.

Since your code uses buffers of 128 codepoints for decoding chunks of the input strings, you should test with strings with codepoint length <128, =128, and >128 (for both string1 and string2, with all combinations).
I see that you have tests using both UTF-8 and SJIS, which is great. But I would like to encourage you to test with more text encodings.
You also mentioned that there are userland implementations already of this algorithm? If so, I recommend you cut-and-paste an existing (pure PHP) implementation into your test file (make sure the license is compatible and credit the author). Then generate 10,000 or so random strings (if this does not make the test too slow), pass them to both the userland implementation and the new mb_levenshtein, and assert that the return value is the same.

☝🏻 These would some good ways to start testing this code more thoroughly, but they are not the end. I think there is still more that could be done. For example:

Is it true that a lower bound on Levenshtein distance is abs(len1 - len2) * cost_ins? If so, maybe you could add a debug assertion at the end of the function like ZEND_ASSERT(c0 >= lower_bound);.
I guess an upper bound would be either (min(len1, len2) * cost_rep) + (abs(len1 - len2) * cost_ins), or else (min(len1, len2) * cost_del) + (max(len1, len2) * cost_ins), depending on the cost values involved?

Maybe you could add debug assertions at the end of the function saying that c0 is <= both of those quantities?

It is also important to fuzz your code. Let me know if you need help on how to do this.

youkidearitai · 2024-09-25T13:26:11Z

@alexdowad Thank you very much for your review! I would like to reflect on the points raised.

It is also important to fuzz your code. Let me know if you need help on how to do this.

I'm interested. Could you tell me fuzz code?

ext/mbstring/mbstring.c

Co-authored-by: Niels Dossche <[email protected]>

cmb69 · 2024-09-25T16:33:45Z

I have a general concern about this, namely because MBString works on codepoints but not graphemes, to my knowledge. However, I would expect a levenshtein distance to be calculated on graphemes. For instance, I would expect the two string of https://3v4l.org/iZonm to have distance 0. If the distance is calculated based on codepoints, users would need to apply Unicode normalization, which is not supported by MBString at all (or has that changed in the meantime?)

youkidearitai · 2024-09-25T16:51:26Z

@cmb69 Indeed. We need support to grapheme cluster in grapheme_levenshtein . However, I think need support other character encoding (not Unicode) for levenshtein function. In conclusion, I think need also grapheme_levenshtein and mb_levenshtein.

youkidearitai · 2024-09-25T16:54:55Z

Surely, I'm confused that should to implement to mbstring function or grapheme function. (ex: Is strrev should implement to mbstring or grapheme?)

ext/mbstring/mbstring.c

alexdowad · 2024-09-26T11:19:58Z

I'm interested. Could you tell me fuzz code?

Dear @youkidearitai, I would like to kindly ask that after working on points raised by @nielsdos, you then work on adding more tests to the test suite, as per my suggestions. After that, if CI is passing and there are no more comments from reviewers, I can work with you to fuzz the code.

I can either just write a fuzzer and give it to you, or work together with you to write it.

youkidearitai · 2024-09-27T12:34:12Z

Hmm... CI is not passed on Windows. My environment is passed. Why is it?

ext/mbstring/tests/mb_levenshtein.phpt

cmb69 · 2024-09-27T12:43:47Z

The soap failures are unrelated (see #16084).

Co-authored-by: Christoph M. Becker <[email protected]>

youkidearitai · 2024-09-27T12:45:28Z

@cmb69 Thank you very much!

A userland function to compare multiple code points. We uses that code for test code.

Add watchstate's test code

youkidearitai · 2024-10-03T16:47:15Z

Processed for each code point. it might be slow.
And I used code from https://github.com/arabcoders/watchstate for test code . That code is MIT License.

youkidearitai · 2024-10-08T17:00:58Z

I added test case for variable selector. This seems not intuitive, because mbstring is based on codepoint unit.
I'll try to implement for grapheme_levenshtein function.

I think one of usecase is compare per codepoint emoji.

youkidearitai · 2024-10-14T07:28:59Z

I added test case for Emoji. I think one of use case mb_levenshtein that compare emoji per code point.

alexdowad · 2024-10-15T12:06:51Z

ext/mbstring/tests/mb_levenshtein.phpt

+echo '--- Usecase of userland code ---' . \PHP_EOL;
+
+$bytes = "";
+for ($i = 0; $i < 100; $i++) {


I see you are using 100 random test cases here. Maybe you are afraid of making the tests too slow? But unless the Levenshtein algorithm is very expensive, I think increasing this to 1000 or 10,000 shouldn't affect runtime of the test much. Am I wrong?

See https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity. So yes, the algorithm is rather expensive, but unless the test is slower than ~1sec, we can increase the number of iterations. And we might increase even further when marking the test as SLOW_TEST.

alexdowad · 2024-10-15T12:13:45Z

@youkidearitai, it's good that you are gradually adding more tests here.

Just one thought came to my mind... if the user passes huge values for cost_ins, cost_rep, and cost_del, could there be integer overflow? Is that an issue which we need to consider? (If "no", then I apologize for raising the issue unnecessarily.)

I also see that in the LINUX_X64_RELEASE_NTS test run (https://github.com/php/php-src/actions/runs/11322861154/job/31484377091?pr=16043), it seems the new C code returned different results from the PHP userland implementation. That needs to be investigated.

Which reminds me... when you include randomly generated test cases, it is always good to explicitly seed the RNG, so that you get the same randomly generated test cases every time, and then the test results are consistent. In some other .phpt test files for mbstring, you can see that I use srand for this purpose. (Just pick some arbitrary integer as the parameter to srand.) I'm not sure if the RNG function you are using requires a different way of setting the seed.

zonuexe · 2024-10-16T07:21:04Z

@cmb69

I have a general concern about this, namely because MBString works on codepoints but not graphemes, to my knowledge. However, I would expect a levenshtein distance to be calculated on graphemes. For instance, I would expect the two string of https://3v4l.org/iZonm to have distance 0. If the distance is calculated based on codepoints, users would need to apply Unicode normalization, which is not supported by MBString at all (or has that changed in the meantime?)

You are correct that using graphemes rather than codepoints is more appropriate for calculating the Levenshtein distance. However, your example is more an issue of Unicode equivalence and normalization than graphemes. If the Levenshtein function implicitly performs normalization, it would introduce unavoidable and complex behaviors. It would be better to have users perform normalization themselves, if necessary, before passing data to the Levenshtein function.

In my opinion, the result of grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}") should be 1.

youkidearitai · 2024-10-16T13:40:30Z

Thanks @zonuexe. We suggest use normalization when use mb_levenshtein too.

In my opinion, the result of grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}") should be 1.

Please watch below:
https://github.com/php/php-src/pull/16428/files#diff-f70cf2c7e6e1cd41b406d63139571a962537a9b40fa5a14b2d14afb026dbe404R62

ICU(usearch.h) match algorithm is Compatibility Equivalent. Therefore, result is 0.

youkidearitai · 2024-10-16T13:41:11Z

Please take a time. I'll reply later.

youkidearitai · 2025-02-05T15:07:46Z

I would like resume this PR. Just a moment, please.

[Draft][Require RFC] mb_levenshtein function

af72b0b

github-actions bot added the Extension: mbstring label Sep 25, 2024

youkidearitai mentioned this pull request Sep 25, 2024

Feature request: levenshtein with multibyte support #10180

Open

Delete unused variable

aa2b209

alexdowad reviewed Sep 25, 2024

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

Fix out of memory

4d8aa99

Fix overflow (maybe)

4b4f8a0

nielsdos reviewed Sep 25, 2024

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

Update ext/mbstring/mbstring.c

952af91

Co-authored-by: Niels Dossche <[email protected]>

nielsdos reviewed Sep 25, 2024

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

Fix zero byte to zero codepoint

f764cbf

youkidearitai force-pushed the mb_levenshtein branch from e6d777b to f764cbf Compare September 26, 2024 01:28

Fix asan memory error

8108bc2

youkidearitai added 2 commits September 27, 2024 00:52

Add test code and remove unnecessary code

3bedd87

Add ISO-8859-1 pattern

9233ecc

cmb69 reviewed Sep 27, 2024

View reviewed changes

ext/mbstring/tests/mb_levenshtein.phpt Show resolved Hide resolved

Update ext/mbstring/tests/mb_levenshtein.phpt

cf56777

Co-authored-by: Christoph M. Becker <[email protected]>

youkidearitai added 2 commits September 30, 2024 00:25

Update test code using from KEINOS/mb_levenshtein

d27cfa0

A userland function to compare multiple code points. We uses that code for test code.

Implove create random strings

4fbb4d4

youkidearitai force-pushed the mb_levenshtein branch from 13ba048 to 4fbb4d4 Compare October 1, 2024 07:58

Fix mb_levenshtein to per codepoint

4f255f7

Add watchstate's test code

Add test case for variable selector

e4de70f

youkidearitai mentioned this pull request Oct 14, 2024

[Draft][Require RFC] Add grapheme_levenshtein function. #16428

Draft

Add test case of emoji.

283512b

I think one of usecase is compare per codepoint emoji.

alexdowad reviewed Oct 15, 2024

View reviewed changes

If testcase if failed, output to UTF-16.

916887e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft][Require RFC] mb_levenshtein function #16043

[Draft][Require RFC] mb_levenshtein function #16043

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024 •

edited

Loading

alexdowad commented Sep 25, 2024 •

edited

Loading

youkidearitai commented Sep 25, 2024 •

edited

Loading

cmb69 commented Sep 25, 2024

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024 •

edited

Loading

alexdowad commented Sep 26, 2024

youkidearitai commented Sep 27, 2024

cmb69 commented Sep 27, 2024

youkidearitai commented Sep 27, 2024

youkidearitai commented Oct 3, 2024

youkidearitai commented Oct 8, 2024

youkidearitai commented Oct 14, 2024

alexdowad Oct 15, 2024

cmb69 Oct 15, 2024

alexdowad commented Oct 15, 2024

zonuexe commented Oct 16, 2024

youkidearitai commented Oct 16, 2024

youkidearitai commented Oct 16, 2024

youkidearitai commented Feb 5, 2025

[Draft][Require RFC] mb_levenshtein function #16043

Are you sure you want to change the base?

[Draft][Require RFC] mb_levenshtein function #16043

Conversation

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024 • edited Loading

alexdowad commented Sep 25, 2024 • edited Loading

youkidearitai commented Sep 25, 2024 • edited Loading

cmb69 commented Sep 25, 2024

youkidearitai commented Sep 25, 2024

youkidearitai commented Sep 25, 2024 • edited Loading

alexdowad commented Sep 26, 2024

youkidearitai commented Sep 27, 2024

cmb69 commented Sep 27, 2024

youkidearitai commented Sep 27, 2024

youkidearitai commented Oct 3, 2024

youkidearitai commented Oct 8, 2024

youkidearitai commented Oct 14, 2024

alexdowad Oct 15, 2024

Choose a reason for hiding this comment

cmb69 Oct 15, 2024

Choose a reason for hiding this comment

alexdowad commented Oct 15, 2024

zonuexe commented Oct 16, 2024

youkidearitai commented Oct 16, 2024

youkidearitai commented Oct 16, 2024

youkidearitai commented Feb 5, 2025

youkidearitai commented Sep 25, 2024 •

edited

Loading

alexdowad commented Sep 25, 2024 •

edited

Loading

youkidearitai commented Sep 25, 2024 •

edited

Loading

youkidearitai commented Sep 25, 2024 •

edited

Loading