You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and If such string is early in file, everything is ok. If it's "further away", CHCSVParser _loadMoreIfNecessary fails to get over the "�". Guess it's interpreted as two separate halves of real character and first half is 0x00?
_stringBuffer could contain 15000 bytes, but [NSString initWithBytes] just returns nil until 15k+ has been incremented to zero one number at a time. Document ends on input file row 20, when file contains 1200 rows.
Forcing encoding as UTF8 helped a little bit, but not with this character.
Any ideas why first buffer[CHUNK_SIZE] from _sniffEncoding would be different than the rest from _loadMoreIfNecessary, can't see much difference? StreamEncoding is always NSUTF8StringEncoding. Any way to fix input stream data before trying to parse it?
Debugged this a little bit. Problem in my case is that data is auto-recognised as UTF-8 based on first 512 bytes, while it later contains unicode characters due the corrupted input. I have no control over the input data, it's generated by external closed system.
Problem seems be that CHCSVParser is pretty much built on top of NSString methods, which do not like unexpected u'\U0000fffd' characters while expecting UTF-8. I tried to skip over bad data with [self _advance], but that was calling NSString methods instead of actually skipping over raw data.
Btw according to NSString documentation "If the length of the byte string is greater than the specified length a nil value is returned", so I don't really understand what readLength--; is supposed to do. It should cause failure immediately and in my case it did. About 15000 times at some point.
- (void)_loadMoreIfNecessary {
...
// try to turn the next portion of the buffer into a stringNSUInteger readLength = [_stringBuffer length];
while (readLength > 0) {
NSString *readString = [[NSStringalloc] initWithBytes:[_stringBuffer bytes] length:readLength encoding:_streamEncoding];
if (readString == nil) {
readLength--;
} else {
[_string appendString:readString];
break;
}
};
No easy fixes: either I have to parse and fix input before the real parsing or modify CHCSVParser to drop NSString and work with raw data. Pre-parsing should be easier.
Trying to parse external TSV data, which I cannot fix before parsing. The weird problem is that data contains "bad" values e.g.
and If such string is early in file, everything is ok. If it's "further away", CHCSVParser _loadMoreIfNecessary fails to get over the "�". Guess it's interpreted as two separate halves of real character and first half is 0x00?
_stringBuffer could contain 15000 bytes, but [NSString initWithBytes] just returns nil until 15k+ has been incremented to zero one number at a time. Document ends on input file row 20, when file contains 1200 rows.
Forcing encoding as UTF8 helped a little bit, but not with this character.
Any ideas why first buffer[CHUNK_SIZE] from _sniffEncoding would be different than the rest from _loadMoreIfNecessary, can't see much difference? StreamEncoding is always NSUTF8StringEncoding. Any way to fix input stream data before trying to parse it?
Bad data in first buffer
The text was updated successfully, but these errors were encountered: