You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
JapaneseWordPieceTokenizer which we use to build the vocabulary recognizes '\n' (or ' ') as a token. BertSudachipyTokenizer however removes them from the tokenization results.
Currently we just ignore those tokens (and problems caused by that (#54)).
We may need some error handling on the vocab file corruption.
It maybe better to make those tokens used.
In this case we need to prepare a new vocab file format (current txt format cannot handle '\n').
We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens.
In the case we do not use those tokens, we should remove them during vocab building.
The text was updated successfully, but these errors were encountered:
JapaneseWordPieceTokenizer
which we use to build the vocabulary recognizes '\n' (or ' ') as a token.BertSudachipyTokenizer
however removes them from the tokenization results.Currently we just ignore those tokens (and problems caused by that (#54)).
We may need some error handling on the vocab file corruption.
It maybe better to make those tokens used.
In this case we need to prepare a new vocab file format (current txt format cannot handle '\n').
We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens.
In the case we do not use those tokens, we should remove them during vocab building.
The text was updated successfully, but these errors were encountered: