Vocabulary file handling #57

mh-northlander · 2023-03-10T02:43:50Z

JapaneseWordPieceTokenizer which we use to build the vocabulary recognizes '\n' (or ' ') as a token.
BertSudachipyTokenizer however removes them from the tokenization results.
Currently we just ignore those tokens (and problems caused by that (#54)).

We may need some error handling on the vocab file corruption.
It maybe better to make those tokens used.
In this case we need to prepare a new vocab file format (current txt format cannot handle '\n').
We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens.
In the case we do not use those tokens, we should remove them during vocab building.

The text was updated successfully, but these errors were encountered:

hiroshi-matsuda-rit mentioned this issue May 8, 2023

The entry of \n in vocab.txt is causing token index shifting #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary file handling #57

Vocabulary file handling #57

mh-northlander commented Mar 10, 2023

Vocabulary file handling #57

Vocabulary file handling #57

Comments

mh-northlander commented Mar 10, 2023