[SEP] token removed from input #2

shamilcm · 2019-09-12T05:01:42Z

The dataset reader seems to trim away the [SEP] token from the input. Should it be removed?

OpenNMT-APE/onmt/inputters/text_dataset.py

Line 112 in 89be3c1

tokens = ' '.join([src_A, src_B]).split()

goncalomcorreia · 2019-09-12T09:40:20Z

Hi! Yes, it is removed since the pytorch-pretrained-bert API takes the tokenized list of strings with segments A and B (without the [SEP] token) and a segment ID vector where 0 denotes segment A and 1 denotes segment B. In my code, the [SEP] token is only there to be able to construct this segment ID vector :)

shamilcm · 2019-09-12T10:16:49Z

The example in the README for pytorch-pretrained-bert (for v6.1 https://github.com/huggingface/pytorch-transformers/blob/8f46cd105752c1f1218a2716ea423454273ff08b/README.md) takes [SEP] token also for constructing the segment ids similar to the paper:

assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

Is the removal done for for the bert-base-multilingual-cased?

goncalomcorreia · 2019-09-12T10:46:01Z

Thanks for noticing this! The code was made for an older version of pytorch-pretrained-bert.

It seems that it doesn't work like this anymore. This is how it worked before:

https://github.com/huggingface/pytorch-transformers/blob/d821358884e45e92164a7bc773e4bc47eed1b591/README.md

goncalomcorreia · 2019-09-12T10:47:36Z

The version this code works on is 0.5.1.

goncalomcorreia closed this as completed Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEP] token removed from input #2

[SEP] token removed from input #2

shamilcm commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019

shamilcm commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019

[SEP] token removed from input #2

[SEP] token removed from input #2

Comments

shamilcm commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019

shamilcm commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019

goncalomcorreia commented Sep 12, 2019