You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to documentation Str module, which you heavily use for NLP, is not working with UTF-8 at all, and for this reason it seems it does not fit for the general task of text processing.
For example Owl_nlp_utils.regexp_split defined as Str.regexp "[ \t;,.'!?()’“”\\/&—\\-]+". Note that it contains two special quotation mark each 3 bytes long:
This is my personal opinion, however we should move away from Str and use re, which is the recommended library, and the fastest if that supports utf-8 in some way.
Otherwise I agree with you that pcre could be a better fit.
According to documentation
Str
module, which you heavily use for NLP, is not working with UTF-8 at all, and for this reason it seems it does not fit for the general task of text processing.For example
Owl_nlp_utils.regexp_split
defined asStr.regexp "[ \t;,.'!?()’“”\\/&—\\-]+"
. Note that it contains two special quotation mark each 3 bytes long:This brings us to the problem:
Because (of the second char
\128
):To solve this I propose switching to
Pcre
or similar libraries, which accept Unicode regular expressions:What do you think?
The text was updated successfully, but these errors were encountered: