Unicode regexp for NLP #599

vitalydolgov · 2021-12-08T23:08:06Z

According to documentation Str module, which you heavily use for NLP, is not working with UTF-8 at all, and for this reason it seems it does not fit for the general task of text processing.

For example Owl_nlp_utils.regexp_split defined as Str.regexp "[ \t;,.'!?()’“”\\/&—\\-]+". Note that it contains two special quotation mark each 3 bytes long:

# String.length "“”";;
- : int = 6

This brings us to the problem:

# Str.split (Str.regexp "[“”]+") "рык";;
- : string list = ["Ñ"; "ык"]

Because (of the second char \128):

# List.of_seq @@ String.to_seq "рык";;
- : char list = ['\209'; \128'; '\209'; '\139'; '\208'; '\186']
# List.of_seq @@ String.to_seq "“";;
- : char list = ['\226'; '\128'; '\156']

To solve this I propose switching to Pcre or similar libraries, which accept Unicode regular expressions:

# Pcre.split ~rex:(Pcre.regexp "[“]+" ~flags:[`UTF8]) "рык“льва";;
- : string list = ["рык"; "льва"]

What do you think?

The text was updated successfully, but these errors were encountered:

mseri · 2021-12-08T23:17:18Z

This is my personal opinion, however we should move away from Str and use re, which is the recommended library, and the fastest if that supports utf-8 in some way.

Otherwise I agree with you that pcre could be a better fit.

mseri · 2021-12-08T23:20:51Z

Unfortunately re does not support utf-8 at the moment: ocaml/ocaml-re#24

mseri added the enhancement label Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode regexp for NLP #599

Unicode regexp for NLP #599

vitalydolgov commented Dec 8, 2021

mseri commented Dec 8, 2021

mseri commented Dec 8, 2021

Unicode regexp for NLP #599

Unicode regexp for NLP #599

Comments

vitalydolgov commented Dec 8, 2021

mseri commented Dec 8, 2021

mseri commented Dec 8, 2021