Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft][Require RFC] mb_levenshtein function #16043

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

youkidearitai
Copy link
Contributor

RFC: https://wiki.php.net/rfc/mb_levenshtein

I would like want to multibyte of levenshtein function. So I implemented to mb_levenshtein function. This PR require an RFC.

@youkidearitai
Copy link
Contributor Author

ref: #10180

@youkidearitai
Copy link
Contributor Author

youkidearitai commented Sep 25, 2024

CI is failed. I'll fix as soon as possible.

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@alexdowad
Copy link
Contributor

alexdowad commented Sep 25, 2024

@youkidearitai, thanks very much (as always) for your work on mbstring. It looks like the RFC for mb_levenshtein is not yet approved; is that right? If so, I guess this code is a POC showing that the RFC is practical to implement?

The code generally looks good to me. I do think you need more tests, though.

  1. Since your code uses buffers of 128 codepoints for decoding chunks of the input strings, you should test with strings with codepoint length <128, =128, and >128 (for both string1 and string2, with all combinations).

  2. I see that you have tests using both UTF-8 and SJIS, which is great. But I would like to encourage you to test with more text encodings.

  3. You also mentioned that there are userland implementations already of this algorithm? If so, I recommend you cut-and-paste an existing (pure PHP) implementation into your test file (make sure the license is compatible and credit the author). Then generate 10,000 or so random strings (if this does not make the test too slow), pass them to both the userland implementation and the new mb_levenshtein, and assert that the return value is the same.

☝🏻 These would some good ways to start testing this code more thoroughly, but they are not the end. I think there is still more that could be done. For example:

  1. Is it true that a lower bound on Levenshtein distance is abs(len1 - len2) * cost_ins? If so, maybe you could add a debug assertion at the end of the function like ZEND_ASSERT(c0 >= lower_bound);.

  2. I guess an upper bound would be either (min(len1, len2) * cost_rep) + (abs(len1 - len2) * cost_ins), or else (min(len1, len2) * cost_del) + (max(len1, len2) * cost_ins), depending on the cost values involved?

Maybe you could add debug assertions at the end of the function saying that c0 is <= both of those quantities?

  1. It is also important to fuzz your code. Let me know if you need help on how to do this.

@youkidearitai
Copy link
Contributor Author

youkidearitai commented Sep 25, 2024

@alexdowad Thank you very much for your review! I would like to reflect on the points raised.

It is also important to fuzz your code. Let me know if you need help on how to do this.

I'm interested. Could you tell me fuzz code?

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
Co-authored-by: Niels Dossche <[email protected]>
@cmb69
Copy link
Member

cmb69 commented Sep 25, 2024

I have a general concern about this, namely because MBString works on codepoints but not graphemes, to my knowledge. However, I would expect a levenshtein distance to be calculated on graphemes. For instance, I would expect the two string of https://3v4l.org/iZonm to have distance 0. If the distance is calculated based on codepoints, users would need to apply Unicode normalization, which is not supported by MBString at all (or has that changed in the meantime?)

@youkidearitai
Copy link
Contributor Author

@cmb69 Indeed. We need support to grapheme cluster in grapheme_levenshtein . However, I think need support other character encoding (not Unicode) for levenshtein function. In conclusion, I think need also grapheme_levenshtein and mb_levenshtein.

@youkidearitai
Copy link
Contributor Author

youkidearitai commented Sep 25, 2024

Surely, I'm confused that should to implement to mbstring function or grapheme function. (ex: Is strrev should implement to mbstring or grapheme?)

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@alexdowad
Copy link
Contributor

I'm interested. Could you tell me fuzz code?

Dear @youkidearitai, I would like to kindly ask that after working on points raised by @nielsdos, you then work on adding more tests to the test suite, as per my suggestions. After that, if CI is passing and there are no more comments from reviewers, I can work with you to fuzz the code.

I can either just write a fuzzer and give it to you, or work together with you to write it.

@youkidearitai
Copy link
Contributor Author

Hmm... CI is not passed on Windows. My environment is passed. Why is it?

@cmb69
Copy link
Member

cmb69 commented Sep 27, 2024

The soap failures are unrelated (see #16084).

@youkidearitai
Copy link
Contributor Author

@cmb69 Thank you very much!

A userland function to compare multiple code points.
We uses that code for test code.
Add watchstate's test code
@youkidearitai
Copy link
Contributor Author

Processed for each code point. it might be slow.
And I used code from https://github.com/arabcoders/watchstate for test code . That code is MIT License.

@youkidearitai
Copy link
Contributor Author

I added test case for variable selector. This seems not intuitive, because mbstring is based on codepoint unit.
I'll try to implement for grapheme_levenshtein function.

I think one of usecase is compare per codepoint emoji.
@youkidearitai
Copy link
Contributor Author

I added test case for Emoji. I think one of use case mb_levenshtein that compare emoji per code point.

echo '--- Usecase of userland code ---' . \PHP_EOL;

$bytes = "";
for ($i = 0; $i < 100; $i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are using 100 random test cases here. Maybe you are afraid of making the tests too slow? But unless the Levenshtein algorithm is very expensive, I think increasing this to 1000 or 10,000 shouldn't affect runtime of the test much. Am I wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity. So yes, the algorithm is rather expensive, but unless the test is slower than ~1sec, we can increase the number of iterations. And we might increase even further when marking the test as SLOW_TEST.

@alexdowad
Copy link
Contributor

@youkidearitai, it's good that you are gradually adding more tests here.

Just one thought came to my mind... if the user passes huge values for cost_ins, cost_rep, and cost_del, could there be integer overflow? Is that an issue which we need to consider? (If "no", then I apologize for raising the issue unnecessarily.)

I also see that in the LINUX_X64_RELEASE_NTS test run (https://github.com/php/php-src/actions/runs/11322861154/job/31484377091?pr=16043), it seems the new C code returned different results from the PHP userland implementation. That needs to be investigated.

Which reminds me... when you include randomly generated test cases, it is always good to explicitly seed the RNG, so that you get the same randomly generated test cases every time, and then the test results are consistent. In some other .phpt test files for mbstring, you can see that I use srand for this purpose. (Just pick some arbitrary integer as the parameter to srand.) I'm not sure if the RNG function you are using requires a different way of setting the seed.

@zonuexe
Copy link

zonuexe commented Oct 16, 2024

@cmb69

I have a general concern about this, namely because MBString works on codepoints but not graphemes, to my knowledge. However, I would expect a levenshtein distance to be calculated on graphemes. For instance, I would expect the two string of https://3v4l.org/iZonm to have distance 0. If the distance is calculated based on codepoints, users would need to apply Unicode normalization, which is not supported by MBString at all (or has that changed in the meantime?)

You are correct that using graphemes rather than codepoints is more appropriate for calculating the Levenshtein distance. However, your example is more an issue of Unicode equivalence and normalization than graphemes. If the Levenshtein function implicitly performs normalization, it would introduce unavoidable and complex behaviors. It would be better to have users perform normalization themselves, if necessary, before passing data to the Levenshtein function.

In my opinion, the result of grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}") should be 1.

@youkidearitai
Copy link
Contributor Author

Thanks @zonuexe. We suggest use normalization when use mb_levenshtein too.

In my opinion, the result of grapheme_levenshtein("\u{0065}\u{0301}", "\u{00e9}") should be 1.

Please watch below:
https://github.com/php/php-src/pull/16428/files#diff-f70cf2c7e6e1cd41b406d63139571a962537a9b40fa5a14b2d14afb026dbe404R62

ICU(usearch.h) match algorithm is Compatibility Equivalent. Therefore, result is 0.

@youkidearitai
Copy link
Contributor Author

Please take a time. I'll reply later.

@youkidearitai
Copy link
Contributor Author

I would like resume this PR. Just a moment, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants