-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix surrogate pairs splitting in toText()/fromText() #2
Conversation
@michal-kurz thanks for the work on this. I like your approach of separating the fixup into a separate function and I'd like to incorporate that into
It's doubtful to me that writing in a new array is worth it. By the way I wouldn't call it a micro-optimization, and I wouldn't apply "premature optimization here." If you can run some benchmarks that would be worthwhile (I have run some myself while working on diff-match-patch to test changes), but we know that cloning an array carries certain memory requirements that we don't need. Instead of premature optimization I'd suggest that array cloning could be in the category of needless inefficiency, in the sense that it doesn't particularly bring us benefit to do that. Even from the perspective of not wanting to modify the attributes passed in, this algorithm should be idempotent and it's fixing the input in such a way that the input was never valid to begin with, so modifying is fine (they are semantically equal, encoding issues aside).
the dissimilar library takes an approach that in some ways I like better, though I don't know if it's totally better or just different. it includes a Unicode cleanup pass during the other cleanup stages. I'm fairly confident in the approach from google#80 based on the work I did figuring out how to resolve it (we had had some flakey patches for years based on more tactical mixups in the routine). the issue is that the middle-snake algorithm expects to work on symbols that are all unitary in length. when we deal with variable-length encodings, such as UTF-16, the algorithm finds the split inside the variable-length boundary and doesn't have the tools to count that as one. I tried, and while workable, it's much more complicated and requires storing additional data tracking how far off we are in the string offset than we would have been for a fixed-length encoding. shifting the edits remains semantically neutral and benign. while I don't have a proof that it's sound, it seems to be supported by ample real-world testing (we've had zero reported issues over millions of documents). to clear some of this up I've tossed around the idea of prefixing a diff stream with empty diff operations to indicate if we expect counts in the resulting diff objects to refer to bytes, code points, or the native code units according to the library and platform. this is unrelated to the issue here. |
@dmsnell Thank you for sharing your view! I've been working almost exclusively within a pure-functional paradigm, with strong emphasis on clean code, separation of concern, and readability above all else, including performance - that is why overwriting the diffs in place felt inappropriate to me. But I will trust your judgement on this - if you think that within this library (being a very performance-sensitive context) it's rather a "needless inefficiency", then it probably is :) I introduced a new commit which gets rid of the array clone, then.
My whole point was finding out, if there isn't a singular logical "choke point", where we could fix this issue, and prevent it from propagating to all the subsequent branches of logic - since your process for joining surrogate pairs can be applied at pretty much any level. But I get from your take, that you don't know about such singular point - neither do I. In which case it does make sense to simply solve it right before the problem manifests (right before calling
I was actually wondering about this - whether you have a mental proof for your solution, as it's absolutely clear to me why it would work, but in no way clear why it must work. But yes, you mentioning using this fix in production of a note-app for multiple years sounds more than convincing :) |
I get this, I really do. But even in all those contexts we should hopefully always ask ourselves what it's giving us. Given that the difference is a couple of lines in a single function I think the only issue at hand is purity, which again, I don't think matters since we're fixing something that was broken. If you look at the rest of the library you'll see it's highly imperative anyway.
to be clear the choke-point is URI encoding is the point where we cannot allow split surrogates, and so by cleaning up the list before then we can be sure we won't crash or break anything. using it elsewhere might just be a nice consideration to help everyone out. otherwise the fix you're after IMO isn't tenable in JavaScript. that requires iterating a string via code points and referencing those via code point index, for which there is no great solution in JS. Python3 gets around this natively but ironically, by assuming its strings are UTF8-encoded, it makes itself open to failures that other using full UTF-32 gets the job done, but at a very high cost. we might be able to move this cleanup into the other semantic cleanup passes, as
the model turns out to be trivial. we already have known patch-rewriting rules and so if we come across patches that split surrogates, we can rewrite them to avoid doing that by shifting the first or last characters out of the active patch operation. the primary risk is the empty operations, and a wonder if it would ever be possible to double-split something on accident, though I'm guessing we could resolve this once and for all by writing up the table of possible rewrites and confirming what could happen. UTF-16 characters (the unit of JavaScript's strings) can only be one or two code units long, so we don't have a large number of possible situations to cover. it helps that the diff operations include the equal substrings for hope this is helpful! |
254688b
to
cab92fc
Compare
cab92fc
to
08e5792
Compare
No description provided.