-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer text recovery problem #20
Comments
It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like control characters (see also issue #17). How to best proceed from here depends on what you want to achieve. Do you want to be able to perfectly detokenize any text or do you want to address the particular tokenization error in your example, i.e. that colon and paren are erroneously merged into a single token? The former would require a lot more work than the latter. Detokenization, alternative 1: SoMaJo could try to keep all the information that is necessary to reconstruct the original input. This might be feasible for whitespace. However, being able to do the same thing for some of the nasty characters that SoMaJo removes (control characters, soft hyphen, zero-width space, etc.) would require deeper changes. Detokenization, alternative 2: You could solve the problem externally. The Addressing the tokenization error: Emoticons that contain an erroneous space should be quite rare. If you do not need to recognize them (for example because regular sequences of colon, space and paren are much more frequent in your data), you could try to deacticate that feature of the tokenizer. Unfortunately, there is no API for doing that, but a small hack can do the trick: You can set the regular expression that recognizes emoticons with a space to something that never matches, e.g. import somajo
import regex as re
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
tokenizer._tokenizer.space_emoticon = re.compile(r"$^")
paragraph = ["Angebotener Hersteller/Typ: (vom Bieter einzutragen) Im \
Einheitspreis sind alle erforderlichen \
Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
for token in sent:
print(token, " --> ", token.original_spelling) And here is the output:
|
I am trying to recover the text but it is not possible since the
token.original_spelling
for a token: (
does not contain the original number of spaces.Here is a motivating example:
This prints
It would be great if this could somehow be resolved. Thanks!
The text was updated successfully, but these errors were encountered: