Unexpected content length doc_length and duplicate_ngram_chr_fraction #357
-
Additionally the results indicate This is the source text exact as used:
I am at a loss to understand why the duplicate flag does not pass and why doc length says it is just 700+ when it is really about 3000+ characters. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
... I can solve the first issue with something like:
Apparently it has some troubles with some invisible characters or newlines? However the second issue persists even after above cleanse, which results in a single line string text.
|
Beta Was this translation helpful? Give feedback.
-
I believe it is the number of spacy tokens. So yes the docs are wrong. For the other error how does the duplicate_ngram_chr_fraction_5 change when applying the fix? The n_gram used are spacy n_grams and sometimes it does treat e.g. double white spaces as a separate token. I could imagine it could do something similar with special tokens. |
Beta Was this translation helpful? Give feedback.
-
yes, I can solve the ngram issue with the "fix" I mention (basically just strip out whatever is not expected in a "normal" text. Fix means the value becomes much lower, as expected. I guess this is solved then, pending DOC update! |
Beta Was this translation helpful? Give feedback.
I believe it is the number of spacy tokens. So yes the docs are wrong.
For the other error how does the duplicate_ngram_chr_fraction_5 change when applying the fix? The n_gram used are spacy n_grams and sometimes it does treat e.g. double white spaces as a separate token. I could imagine it could do something similar with special tokens.