Fix surrogate pairs splitting in toText()/fromText() #2

michal-kurz · 2022-10-04T14:18:03Z

No description provided.

dmsnell · 2022-10-04T20:03:46Z

@michal-kurz thanks for the work on this. I like your approach of separating the fixup into a separate function and I'd like to incorporate that into toDelta as well

It's quite clear that dpm prioritizes performance above all else…is any value in not creating a separate newDiffs array…Likewise…it would require an additional iteration of all diffs for each patch

It's doubtful to me that writing in a new array is worth it. By the way I wouldn't call it a micro-optimization, and I wouldn't apply "premature optimization here." If you can run some benchmarks that would be worthwhile (I have run some myself while working on diff-match-patch to test changes), but we know that cloning an array carries certain memory requirements that we don't need. Instead of premature optimization I'd suggest that array cloning could be in the category of needless inefficiency, in the sense that it doesn't particularly bring us benefit to do that. Even from the perspective of not wanting to modify the attributes passed in, this algorithm should be idempotent and it's fixing the input in such a way that the input was never valid to begin with, so modifying is fine (they are semantically equal, encoding issues aside).

have you maybe considered fixing at patch creation, so this problem would never occur in the first place? Maybe inside patch_make, or something. Although it may very well be that other functions like diff_cleanupSemantic could re-break the patch... I haven't tried this, just wondering if there could be a central way to fix this once for good :)

the dissimilar library takes an approach that in some ways I like better, though I don't know if it's totally better or just different. it includes a Unicode cleanup pass during the other cleanup stages.

I'm fairly confident in the approach from google#80 based on the work I did figuring out how to resolve it (we had had some flakey patches for years based on more tactical mixups in the routine). the issue is that the middle-snake algorithm expects to work on symbols that are all unitary in length. when we deal with variable-length encodings, such as UTF-16, the algorithm finds the split inside the variable-length boundary and doesn't have the tools to count that as one. I tried, and while workable, it's much more complicated and requires storing additional data tracking how far off we are in the string offset than we would have been for a fixed-length encoding.

shifting the edits remains semantically neutral and benign. while I don't have a proof that it's sound, it seems to be supported by ample real-world testing (we've had zero reported issues over millions of documents).

to clear some of this up I've tossed around the idea of prefixing a diff stream with empty diff operations to indicate if we expect counts in the resulting diff objects to refer to bytes, code points, or the native code units according to the library and platform. this is unrelated to the issue here.

michal-kurz · 2022-10-04T22:12:46Z

@dmsnell Thank you for sharing your view!

I've been working almost exclusively within a pure-functional paradigm, with strong emphasis on clean code, separation of concern, and readability above all else, including performance - that is why overwriting the diffs in place felt inappropriate to me. But I will trust your judgement on this - if you think that within this library (being a very performance-sensitive context) it's rather a "needless inefficiency", then it probably is :) I introduced a new commit which gets rid of the array clone, then.

the dissimilar library takes an approach that in some ways I like better, though I don't know if it's totally better or just different. it includes a Unicode cleanup pass during the other cleanup stages.

My whole point was finding out, if there isn't a singular logical "choke point", where we could fix this issue, and prevent it from propagating to all the subsequent branches of logic - since your process for joining surrogate pairs can be applied at pretty much any level. But I get from your take, that you don't know about such singular point - neither do I. In which case it does make sense to simply solve it right before the problem manifests (right before calling encodeURI), and be 100% sure we're in the clear :)

while I don't have a proof that it's sound, it seems to be supported by ample real-world testing

I was actually wondering about this - whether you have a mental proof for your solution, as it's absolutely clear to me why it would work, but in no way clear why it must work. But yes, you mentioning using this fix in production of a note-app for multiple years sounds more than convincing :)

dmsnell · 2022-10-04T22:49:40Z

I've been working almost exclusively within a pure-functional paradigm, with strong emphasis on clean code, separation of concern, and readability above all else, including performance

I get this, I really do. But even in all those contexts we should hopefully always ask ourselves what it's giving us. Given that the difference is a couple of lines in a single function I think the only issue at hand is purity, which again, I don't think matters since we're fixing something that was broken.

If you look at the rest of the library you'll see it's highly imperative anyway. diff-match-patch doesn't "[prioritize] performance above all else," but it's up there in the priority list. Without numbers we're just guessing, but I am confident that if it matters at all, not cloning will be faster with less overhead. 🤷‍♂️

a singular logical "choke point", where we could fix this issue…you don't know about such singular point

to be clear the choke-point is encodeURIComponent (and the associated functions on the other platforms). this issue isn't a problem from diff-match-patch's standpoint, particularly since the library doesn't specify any atomic unit of comparison. technically I don't think patch_toText() even requires the fix because it's valid to have split surrogates in JavaScript strings; but it's good to "fix" it here if other libraries are making assumptions about the output that diff-match-patch itself never guarantees.

URI encoding is the point where we cannot allow split surrogates, and so by cleaning up the list before then we can be sure we won't crash or break anything. using it elsewhere might just be a nice consideration to help everyone out.

otherwise the fix you're after IMO isn't tenable in JavaScript. that requires iterating a string via code points and referencing those via code point index, for which there is no great solution in JS. Python3 gets around this natively but ironically, by assuming its strings are UTF8-encoded, it makes itself open to failures that other diff-match-patch platforms won't experience.

using full UTF-32 gets the job done, but at a very high cost.

we might be able to move this cleanup into the other semantic cleanup passes, as dissimilar does. that would be a worthwhile exploration. other than that I think it's fine deferring until we generate the output because we don't need to run this multiple times, and I think we risk doing that if we put it in the normal cleanup pipeline.

whether you have a mental proof for your solution

the model turns out to be trivial. we already have known patch-rewriting rules and so if we come across patches that split surrogates, we can rewrite them to avoid doing that by shifting the first or last characters out of the active patch operation.

the primary risk is the empty operations, and a wonder if it would ever be possible to double-split something on accident, though I'm guessing we could resolve this once and for all by writing up the table of possible rewrites and confirming what could happen. UTF-16 characters (the unit of JavaScript's strings) can only be one or two code units long, so we don't have a large number of possible situations to cover.

it helps that the diff operations include the equal substrings for DIFF_EQUALS: "hello". it would be another level of complexity if we had only DIFF_EQUALS: 5 as a count of the same characters.

hope this is helpful!

michal-kurz force-pushed the fix-surrogate-pairs-in-text-fns branch from 254688b to cab92fc Compare October 4, 2022 23:03

michal-kurz mentioned this pull request Oct 4, 2022

Library throws "URI malformed" error when creating patch with emojis JackuB/diff-match-patch#22

Open

fix : Stop breaking surrogate pairs in toText()/fromText()

08e5792

michal-kurz force-pushed the fix-surrogate-pairs-in-text-fns branch from cab92fc to 08e5792 Compare October 4, 2022 23:53

michal-kurz merged commit 41b7996 into master Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix surrogate pairs splitting in toText()/fromText() #2

Fix surrogate pairs splitting in toText()/fromText() #2

michal-kurz commented Oct 4, 2022

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022 •

edited

Loading

dmsnell commented Oct 4, 2022

Fix surrogate pairs splitting in toText()/fromText() #2

Fix surrogate pairs splitting in toText()/fromText() #2

Conversation

michal-kurz commented Oct 4, 2022

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022 • edited Loading

dmsnell commented Oct 4, 2022

michal-kurz commented Oct 4, 2022 •

edited

Loading