Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer text recovery problem #20

Open
shabie opened this issue Mar 23, 2021 · 1 comment
Open

Tokenizer text recovery problem #20

shabie opened this issue Mar 23, 2021 · 1 comment

Comments

@shabie
Copy link

shabie commented Mar 23, 2021

I am trying to recover the text but it is not possible since the token.original_spelling for a token : ( does not contain the original number of spaces.

Here is a motivating example:

import somajo
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

This prints

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:(  -->  : (
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

It would be great if this could somehow be resolved. Thanks!

@tsproisl
Copy link
Owner

It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like control characters (see also issue #17).

How to best proceed from here depends on what you want to achieve. Do you want to be able to perfectly detokenize any text or do you want to address the particular tokenization error in your example, i.e. that colon and paren are erroneously merged into a single token? The former would require a lot more work than the latter.

Detokenization, alternative 1: SoMaJo could try to keep all the information that is necessary to reconstruct the original input. This might be feasible for whitespace. However, being able to do the same thing for some of the nasty characters that SoMaJo removes (control characters, soft hyphen, zero-width space, etc.) would require deeper changes.

Detokenization, alternative 2: You could solve the problem externally. The detokenize function from issue #17 almost solves the problem. It should be easy to capture the remaining differences between the detokenized text and the original input with some string alignment algorithm and to add the additional information to the tokens.

Addressing the tokenization error: Emoticons that contain an erroneous space should be quite rare. If you do not need to recognize them (for example because regular sequences of colon, space and paren are much more frequent in your data), you could try to deacticate that feature of the tokenizer. Unfortunately, there is no API for doing that, but a small hack can do the trick: You can set the regular expression that recognizes emoticons with a space to something that never matches, e.g. r"$^" (end of string followed by beginning of string). Here is how you could do that:

import somajo
import regex as re

tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
tokenizer._tokenizer.space_emoticon = re.compile(r"$^")
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

And here is the output:

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:  -->  None
(  -->  None
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants