You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The various decode operations in AbstractNLPDecoder and its underlying tokenizer, use String.getBytes() which converts the String to bytes using the OS's default character set, which can corrupt the String if the default character set differs from the one used by the String. This case will occur on Windows for any UTF-8 data (beyond the ASCII range) since Windows default character set is CP-1252.
Using operations that include specifying the desired character set, such as InputStreamReader will avoid this.
The text was updated successfully, but these errors were encountered:
Thanks for the comment; could you please give an example of where that you think needs to be fixed using InputStreamReader? We'll do the evaluation and apply the update. Thanks.
The simple test below if run on a platform whose default character set is UTF-8 or by explicitly setting the character set (-Dfile.encoding=UTF-8) will produce the expected results. Running on Windows without explicitly setting the character set uses the OS default character set ( equivalent to using -Dfile.encoding=windows-1252) and will garble the non-Latin characters.
Note that running this in a development environment like Eclipse, may not show the error since Eclipse automatically adds the -Dfile.encoding property to the invocation.
The reference to InputStreamReader was just a suggestion, could also do something like someString.getBytes(someCharSet), as long as the Strings/Streams/Files are read with an explicit character set. It would be nice if this character set was a parser/tokenizer config option. If it must be hardcoded, then UTF-8 would likely be the best guess.
thanks
public static void main(String[] args) throws IOException {
System.out.println("Default Charset=" + Charset.defaultCharset());
String configFile = "src/main/resources/org/opensextant/relish/config-decode-en.xml";
NLPDecoder parser = new NLPDecoder(IOUtils.createFileInputStream(configFile));
String text = "We live in Europe (قارة اوروبة).";
List<NLPNode[]> sentences = parser.decodeDocument(text);
for (NLPNode[] sentence : sentences) {
for (NLPNode node : sentence) {
System.out.println(node);
}
}
}
The various decode operations in AbstractNLPDecoder and its underlying tokenizer, use String.getBytes() which converts the String to bytes using the OS's default character set, which can corrupt the String if the default character set differs from the one used by the String. This case will occur on Windows for any UTF-8 data (beyond the ASCII range) since Windows default character set is CP-1252.
Using operations that include specifying the desired character set, such as InputStreamReader will avoid this.
The text was updated successfully, but these errors were encountered: