-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft][Require RFC] mb_levenshtein function #16043
base: master
Are you sure you want to change the base?
Conversation
ref: #10180 |
CI is failed. I'll fix as soon as possible. |
@youkidearitai, thanks very much (as always) for your work on mbstring. It looks like the RFC for The code generally looks good to me. I do think you need more tests, though.
☝🏻 These would some good ways to start testing this code more thoroughly, but they are not the end. I think there is still more that could be done. For example:
Maybe you could add debug assertions at the end of the function saying that
|
@alexdowad Thank you very much for your review! I would like to reflect on the points raised.
I'm interested. Could you tell me fuzz code? |
Co-authored-by: Niels Dossche <[email protected]>
I have a general concern about this, namely because MBString works on codepoints but not graphemes, to my knowledge. However, I would expect a levenshtein distance to be calculated on graphemes. For instance, I would expect the two string of https://3v4l.org/iZonm to have distance 0. If the distance is calculated based on codepoints, users would need to apply Unicode normalization, which is not supported by MBString at all (or has that changed in the meantime?) |
@cmb69 Indeed. We need support to grapheme cluster in |
Surely, I'm confused that should to implement to mbstring function or grapheme function. (ex: Is strrev should implement to mbstring or grapheme?) |
e6d777b
to
f764cbf
Compare
Dear @youkidearitai, I would like to kindly ask that after working on points raised by @nielsdos, you then work on adding more tests to the test suite, as per my suggestions. After that, if CI is passing and there are no more comments from reviewers, I can work with you to fuzz the code. I can either just write a fuzzer and give it to you, or work together with you to write it. |
Hmm... CI is not passed on Windows. My environment is passed. Why is it? |
The soap failures are unrelated (see #16084). |
Co-authored-by: Christoph M. Becker <[email protected]>
@cmb69 Thank you very much! |
A userland function to compare multiple code points. We uses that code for test code.
13ba048
to
4fbb4d4
Compare
Add watchstate's test code
Processed for each code point. it might be slow. |
I added test case for variable selector. This seems not intuitive, because mbstring is based on codepoint unit. |
I think one of usecase is compare per codepoint emoji.
I added test case for Emoji. I think one of use case |
echo '--- Usecase of userland code ---' . \PHP_EOL; | ||
|
||
$bytes = ""; | ||
for ($i = 0; $i < 100; $i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are using 100 random test cases here. Maybe you are afraid of making the tests too slow? But unless the Levenshtein algorithm is very expensive, I think increasing this to 1000 or 10,000 shouldn't affect runtime of the test much. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://en.wikipedia.org/wiki/Levenshtein_distance#Computational_complexity. So yes, the algorithm is rather expensive, but unless the test is slower than ~1sec, we can increase the number of iterations. And we might increase even further when marking the test as SLOW_TEST.
@youkidearitai, it's good that you are gradually adding more tests here. Just one thought came to my mind... if the user passes huge values for I also see that in the LINUX_X64_RELEASE_NTS test run (https://github.com/php/php-src/actions/runs/11322861154/job/31484377091?pr=16043), it seems the new C code returned different results from the PHP userland implementation. That needs to be investigated. Which reminds me... when you include randomly generated test cases, it is always good to explicitly seed the RNG, so that you get the same randomly generated test cases every time, and then the test results are consistent. In some other |
You are correct that using graphemes rather than codepoints is more appropriate for calculating the Levenshtein distance. However, your example is more an issue of Unicode equivalence and normalization than graphemes. If the Levenshtein function implicitly performs normalization, it would introduce unavoidable and complex behaviors. It would be better to have users perform normalization themselves, if necessary, before passing data to the Levenshtein function. In my opinion, the result of |
Thanks @zonuexe. We suggest use normalization when use mb_levenshtein too.
Please watch below: ICU(usearch.h) match algorithm is Compatibility Equivalent. Therefore, result is 0. |
Please take a time. I'll reply later. |
I would like resume this PR. Just a moment, please. |
RFC: https://wiki.php.net/rfc/mb_levenshtein
I would like want to multibyte of levenshtein function. So I implemented to mb_levenshtein function. This PR require an RFC.