Skip to content

Commit

Permalink
Merge branch 'issues/69-broken-surrogate-pairs'
Browse files Browse the repository at this point in the history
Stop breaking surrogate pairs in toDelta()/fromDelta()

Resolves google/diff-match-patch#69 for the following languages:

 - Objective-C
 - Java
 - JavaScript
 - Python2
 - Python3

Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.

This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.

When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.

The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.

Alternative approaches:
=========

 - The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/)
   library in Rust takes a more comprehensive approach with its
   `cleanup_char_boundary()` method. Since that approach resolves the
   issue everywhere and not just in to/from Delta, it's worth
   exploring as a replacement for this patch.

Remaining work to do:
========

 -[ ] Fix CPP or verify not a problem
 -[ ] Fix CSharp or verify not a problem
 -[ ] Fix Dart or verify not a problem
 -[ ] Fix Lua or verify not a problem
 -[x] Refactor to use cleanupSplitSurrogates in JavaScript
 -[x] Refactor to use cleanupSplitSurrogates in Java
 -[ ] Refactor to use cleanupSplitSurrogates in Objective C
 -[ ] Refactor to use cleanupSplitSurrogates in Python2
 -[ ] Refactor to use cleanupSplitSurrogates in Python3
 -[ ] Refactor to use cleanupSplitSurrogates in CPP
 -[ ] Refactor to use cleanupSplitSurrogates in CSharp
 -[ ] Refactor to use cleanupSplitSurrogates in Dart
 -[ ] Refactor to use cleanupSplitSurrogates in Lua
 -[x] Fix patch_toText in JavaScript
 -[ ] Fix patch_toText in Java
 -[ ] Fix patch_toText in Objective C
 -[ ] Fix patch_toText in Python2
 -[ ] Fix patch_toText in Python3
 -[ ] Fix patch_toText in CPP
 -[ ] Fix patch_toText in CSharp
 -[ ] Fix patch_toText in Dart
 -[ ] Fix patch_toText in Lua
 -[ ] Figure out a "minimal" set of unit tests so we can get rid of the big
      chunk currently in the PR, then carry it around to all the libraries.
      The triggers are well understood, so we can write targeted tests
      instead of broad ones.
  • Loading branch information
dmsnell committed Jan 31, 2024
2 parents 8a0d0d8 + a990941 commit 634e10f
Show file tree
Hide file tree
Showing 2 changed files with 259 additions and 1 deletion.
193 changes: 192 additions & 1 deletion objectivec/DiffMatchPatch.m
Original file line number Diff line number Diff line change
Expand Up @@ -1299,7 +1299,28 @@ - (NSString *)diff_text2:(NSMutableArray *)diffs;
- (NSString *)diff_toDelta:(NSMutableArray *)diffs;
{
NSMutableString *delta = [NSMutableString string];
UniChar lastEnd = 0;
for (Diff *aDiff in diffs) {
if (0 == [aDiff.text length]) {
continue;
}

UniChar thisTop = [aDiff.text characterAtIndex:0];
UniChar thisEnd = [aDiff.text characterAtIndex:([aDiff.text length]-1)];

if (CFStringIsSurrogateHighCharacter(thisEnd)) {
lastEnd = thisEnd;
aDiff.text = [aDiff.text substringToIndex:([aDiff.text length] - 1)];
}

if (0 != lastEnd && CFStringIsSurrogateHighCharacter(lastEnd) && CFStringIsSurrogateLowCharacter(thisTop)) {
aDiff.text = [NSString stringWithFormat:@"%C%@", lastEnd, aDiff.text];
}

if (0 == [aDiff.text length]) {
continue;
}

switch (aDiff.operation) {
case DIFF_INSERT:
[delta appendFormat:@"+%@\t", [[aDiff.text diff_stringByAddingPercentEscapesForEncodeUriCompatibility]
Expand All @@ -1321,6 +1342,176 @@ - (NSString *)diff_toDelta:(NSMutableArray *)diffs;
return delta;
}

- (NSUInteger)diff_digit16:(unichar)c
{
switch (c) {
case '0': return 0;
case '1': return 1;
case '2': return 2;
case '3': return 3;
case '4': return 4;
case '5': return 5;
case '6': return 6;
case '7': return 7;
case '8': return 8;
case '9': return 9;
case 'A': case 'a': return 10;
case 'B': case 'b': return 11;
case 'C': case 'c': return 12;
case 'D': case 'd': return 13;
case 'E': case 'e': return 14;
case 'F': case 'f': return 15;
default:
[NSException raise:@"Invalid percent-encoded string" format:@"%c is not a hex digit", c];
}
}

/**
* Decode a percent-encoded UTF-8 string into a string of UTF-16 code units
* This is more permissive than `stringByRemovingPercentEncoding` because
* that fails if the input represents invalid Unicode characters. However, different
* diff-match-patch libraries may encode surrogate halves as if they were valid
* Unicode code points. Therefore, instead of failing or corrupting the output, which
* `stringByRemovingPercentEncoding` does when it inserts "(null)" in these places
* we can decode it anyway and then once the string is reconstructed from the diffs
* we'll end up with valid Unicode again, after the surrogate halves are re-joined
*/
- (NSString *)diff_decodeURIWithText:(NSString *)percentEncoded
{
unichar decoded[[percentEncoded length]];
NSInteger input = 0;
NSInteger output = 0;

@try {
while (input < [percentEncoded length]) {
unichar c = [percentEncoded characterAtIndex:input];

// not special, so just return it
if ('%' != c) {
decoded[output++] = c;
input += 1;
continue;
}

NSUInteger byte1 = ([self diff_digit16:[percentEncoded characterAtIndex:(input+1)]] << 4) +
[self diff_digit16:[percentEncoded characterAtIndex:(input+2)]];

// single-byte UTF-8 first byte has bitmask 0xxx xxxx
if ((byte1 & 0x80) == 0) {
decoded[output++] = byte1;
input += 3;
continue;
}

// at least one continuation byte
if ('%' != [percentEncoded characterAtIndex:(input + 3)]) {
return nil;
}

NSUInteger byte2 = ([self diff_digit16:[percentEncoded characterAtIndex:(input+4)]] << 4) +
[self diff_digit16:[percentEncoded characterAtIndex:(input+5)]];

// continuation bytes have bitmask 10xx xxxx
if ((byte2 & 0xC0) != 0x80) {
return nil;
}

// continuation bytes thus only contribute six bits each
// these data bits are found with the bit mask xx11 1111
byte2 = byte2 & 0x3F;

// in two-byte sequences the first byte has bitmask 110x xxxx
if ((byte1 & 0xE0) == 0xC0) {
// byte1 ___x xxxx << 6
// byte2 __yy yyyy
// value x xxxxyy yyyy -> 11 bits
decoded[output++] = ((byte1 & 0x1F) << 6) | byte2;
input += 6;
continue;
}

// at least two continuation bytes
if ('%' != [percentEncoded characterAtIndex:(input + 6)]) {
return nil;
}

NSUInteger byte3 = ([self diff_digit16:[percentEncoded characterAtIndex:(input+7)]] << 4) +
[self diff_digit16:[percentEncoded characterAtIndex:(input+8)]];

if ((byte3 & 0xC0) != 0x80) {
return nil;
}

byte3 = byte3 & 0x3F;

// in three-byte sequences the first byte has bitmask 1110 xxxx
if ((byte1 & 0xF0) == 0xE0) {
// byte1 ____ xxxx << 12
// byte2 __yy yyyy << 6
// byte3 __zz zzzz
// value xxxxyy yyyyzz zzzz -> 16 bits
decoded[output++] = ((byte1 & 0x0F) << 12) | (byte2 << 6) | byte3;
input += 9;
continue;
}

// three continuation bytes
if ('%' != [percentEncoded characterAtIndex:(input + 9)]) {
return nil;
}

NSUInteger byte4 = ([self diff_digit16:[percentEncoded characterAtIndex:(input+10)]] << 4) +
[self diff_digit16:[percentEncoded characterAtIndex:(input+11)]];

if ((byte4 & 0xC0) != 0x80) {
return nil;
}

byte4 = byte4 & 0x3F;

// in four-byte sequences the first byte has bitmask 1111 0xxx
if ((byte1 & 0xF8) == 0xF0) {
// byte1 ____ _xxx << 18
// byte2 __yy yyyy << 12
// byte3 __zz zzzz << 6
// byte4 __tt tttt
// value xxxyy yyyyzz zzzztt tttt -> 21 bits
NSUInteger codePoint = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0C) | (byte3 << 0x06) | byte4;
if (codePoint >= 0x010000 && codePoint <= 0x10FFFF) {
codePoint -= 0x010000;
decoded[output++] = ((codePoint >> 10) & 0x3FF) | 0xD800;
decoded[output++] = 0xDC00 | (codePoint & 0x3FF);
input += 12;
continue;
}
}

return nil;
}
}
@catch (NSException *e) {
return nil;
}

// some objective-c versions of the library produced patches with
// (null) in the place where surrogates were split across diff
// boundaries. if we leave those in we'll be stuck with a
// high-surrogate (null) low-surrogate pattern that will break
// deeper in the library or consuming application. we'll "fix"
// these by dropping the (null) and re-joining the surrogate halves
NSString *result = [NSString stringWithCharacters:decoded length:output];
NSRegularExpression *replacer = [NSRegularExpression
regularExpressionWithPattern:@"([\\x{D800}-\\x{DBFF}])\\(null\\)([\\x{DC00}-\\x{DFFF}])"
options:0
error:nil];

return [replacer
stringByReplacingMatchesInString:result
options:0
range:NSMakeRange(0, [result length])
withTemplate:@"$1$2"];
}

/**
* Given the original text1, and an encoded NSString which describes the
* operations required to transform text1 into text2, compute the full diff.
Expand Down Expand Up @@ -1348,7 +1539,7 @@ - (NSMutableArray *)diff_fromDeltaWithText:(NSString *)text1
NSString *param = [token substringFromIndex:1];
switch ([token characterAtIndex:0]) {
case '+':
param = [param diff_stringByReplacingPercentEscapesForEncodeUriCompatibility];
param = [self diff_decodeURIWithText:param];
if (param == nil) {
if (error != NULL) {
errorDetail = [NSDictionary dictionaryWithObjectsAndKeys:
Expand Down
67 changes: 67 additions & 0 deletions objectivec/Tests/DiffMatchPatchTest.m
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,68 @@ - (void)test_diff_deltaTest {

XCTAssertEqualObjects(diffs, [dmp diff_fromDeltaWithText:text1 andDelta:delta error:NULL], @"diff_fromDelta: Unicode 2.");

diffs = [dmp diff_mainOfOldString:@"☺️🖖🏿" andNewString:@"☺️😃🖖🏿"];
delta = [dmp diff_toDelta:diffs];

XCTAssertEqualObjects(delta, @"=2\t+%F0%9F%98%83\t=4", @"Delta should match the expected string");

diffs = [dmp diff_mainOfOldString:@"☺️🖖🏿" andNewString:@"☺️😃🖖🏿"];
NSArray *patches = [dmp patch_makeFromDiffs:diffs];
NSArray *patchResult = [dmp patch_apply:patches toString:@"☺️🖖🏿"];

expectedString = [patchResult firstObject];
XCTAssertEqualObjects(@"☺️😃🖖🏿", expectedString, @"Output String should match the Edited one!");

// Unicode - splitting surrogates

// Inserting similar surrogate pair at beginning
diffs = [NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_INSERT andText:@"🅱"],
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅰🅱"],
nil];
XCTAssertEqualObjects( [dmp diff_toDelta:diffs], [dmp diff_toDelta:[dmp diff_mainOfOldString:@"🅰🅱" andNewString:@"🅱🅰🅱"]]);

// Inserting similar surrogate pair in the middle
diffs = [NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅰"],
[Diff diffWithOperation:DIFF_INSERT andText:@"🅰"],
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅱"],
nil];
XCTAssertEqualObjects( [dmp diff_toDelta:diffs], [dmp diff_toDelta:[dmp diff_mainOfOldString:@"🅰🅱" andNewString:@"🅰🅰🅱"]]);

// Deleting similar surrogate pair at the beginning
diffs = [NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_DELETE andText:@"🅱"],
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅰🅱"],
nil];
XCTAssertEqualObjects( [dmp diff_toDelta:diffs], [dmp diff_toDelta:[dmp diff_mainOfOldString:@"🅱🅰🅱" andNewString:@"🅰🅱"]]);

// Deleting similar surrogate pair in the middle
diffs = [NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅰"],
[Diff diffWithOperation:DIFF_DELETE andText:@"🅲"],
[Diff diffWithOperation:DIFF_EQUAL andText:@"🅱"],
nil];
XCTAssertEqualObjects( [dmp diff_toDelta:diffs], [dmp diff_toDelta:[dmp diff_mainOfOldString:@"🅰🅲🅱" andNewString:@"🅰🅱"]]);

// Swapping surrogate pairs
diffs = [NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_DELETE andText:@"🅰"],
[Diff diffWithOperation:DIFF_INSERT andText:@"🅱"],
nil];
XCTAssertEqualObjects( [dmp diff_toDelta:diffs], [dmp diff_toDelta:[dmp diff_mainOfOldString:@"🅰" andNewString:@"🅱"]]);

// Swapping surrogate pairs
XCTAssertEqualObjects( [dmp diff_toDelta:([NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_DELETE andText:@"🅰"],
[Diff diffWithOperation:DIFF_INSERT andText:@"🅱"],
nil])],
[dmp diff_toDelta:([NSMutableArray arrayWithObjects:
[Diff diffWithOperation:DIFF_EQUAL andText:[NSString stringWithFormat:@"%C", 0xd83c]],
[Diff diffWithOperation:DIFF_DELETE andText:[NSString stringWithFormat:@"%C", 0xdd70]],
[Diff diffWithOperation:DIFF_INSERT andText:[NSString stringWithFormat:@"%C", 0xdd71]],
nil])]);

// Verify pool of unchanged characters.
diffs = [NSMutableArray arrayWithObject:
[Diff diffWithOperation:DIFF_INSERT andText:@"A-Z a-z 0-9 - _ . ! ~ * ' ( ) ; / ? : @ & = + $ , # "]];
Expand Down Expand Up @@ -781,6 +843,11 @@ - (void)test_diff_deltaTest {
expectedResult = [dmp diff_fromDeltaWithText:@"" andDelta:delta error:NULL];
XCTAssertEqualObjects(diffs, expectedResult, @"diff_fromDelta: 160kb string. Convert delta string into a diff.");

// Different versions of the library may have created deltas with
// half of a surrogate pair encoded as if it were valid UTF-8
XCTAssertEqualObjects([dmp diff_toDelta:([dmp diff_fromDeltaWithText:@"🅰" andDelta:@"-2\t+%F0%9F%85%B1" error:NULL])],
[dmp diff_toDelta:([dmp diff_fromDeltaWithText:@"🅰" andDelta:@"=1\t-1\t+%ED%B5%B1" error:NULL])]);

[dmp release];
}

Expand Down

0 comments on commit 634e10f

Please sign in to comment.