Skip to content
This repository has been archived by the owner on Jan 27, 2024. It is now read-only.

[SEP] token removed from input #2

Closed
shamilcm opened this issue Sep 12, 2019 · 4 comments
Closed

[SEP] token removed from input #2

shamilcm opened this issue Sep 12, 2019 · 4 comments

Comments

@shamilcm
Copy link

The dataset reader seems to trim away the [SEP] token from the input. Should it be removed?

tokens = ' '.join([src_A, src_B]).split()

@goncalomcorreia
Copy link
Collaborator

Hi! Yes, it is removed since the pytorch-pretrained-bert API takes the tokenized list of strings with segments A and B (without the [SEP] token) and a segment ID vector where 0 denotes segment A and 1 denotes segment B. In my code, the [SEP] token is only there to be able to construct this segment ID vector :)

@shamilcm
Copy link
Author

The example in the README for pytorch-pretrained-bert (for v6.1 https://github.com/huggingface/pytorch-transformers/blob/8f46cd105752c1f1218a2716ea423454273ff08b/README.md) takes [SEP] token also for constructing the segment ids similar to the paper:

assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

Is the removal done for for the bert-base-multilingual-cased?

@goncalomcorreia
Copy link
Collaborator

Thanks for noticing this! The code was made for an older version of pytorch-pretrained-bert.

It seems that it doesn't work like this anymore. This is how it worked before:

https://github.com/huggingface/pytorch-transformers/blob/d821358884e45e92164a7bc773e4bc47eed1b591/README.md

@goncalomcorreia
Copy link
Collaborator

The version this code works on is 0.5.1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants