Skip to content

Unexpected content length doc_length and duplicate_ngram_chr_fraction #357

Discussion options

You must be logged in to vote

Additionally the results indicate doc_length of 733 which cannot be (as long as doc is right and it truly counts characters and nothing else)

I believe it is the number of spacy tokens. So yes the docs are wrong.

For the other error how does the duplicate_ngram_chr_fraction_5 change when applying the fix? The n_gram used are spacy n_grams and sometimes it does treat e.g. double white spaces as a separate token. I could imagine it could do something similar with special tokens.

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by smileBeda
Comment options

You must be logged in to vote
2 replies
@KennethEnevoldsen
Comment options

@KennethEnevoldsen
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants