You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm just wondering if it's possible to fine-tune the model using target text that differs from SELFIES and would require a different tokenization strategy (with a different vocab). I don't understand too well right now if this case is already covered in the implementation, or would require substantial modification.
def data_init(self):
self.tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
unwanted_words = [i for i in self.tokenizer.encoder.keys()]
important_tokens = ['<s>','<pad>','</s>','<unk>']
unwanted_words = list(set(unwanted_words).difference(set(important_tokens)))
for word in unwanted_words:
del self.tokenizer.encoder[word]
selfies_tokens = np.load('../moldata/vocab_list/zinc.npy').tolist()
self.tokenizer.add_tokens(selfies_tokens, special_tokens=False)
self.tokenizer.add_tokens('<mask>', special_tokens=True)
self.model.resize_token_embeddings(len(self.tokenizer))
self.model.load_state_dict(torch.load(self.args.checkpoint_path, map_location='cpu'),strict=False)
I can see here (in finetune.py) that you modify the standard BART tokenizer to handle SELFIES vocab. Would it be necessary to do something similar for tokenizer.decoder too with a different vocabulary?
Apologies if this is trivial.
The text was updated successfully, but these errors were encountered:
You are correct that we modified the standard BART tokenizer to include SELFIES tokens during pre-training. However, to clarify, our pre-training was conducted from scratch using only SELFIES data. As a result, the model was specifically trained to understand SELFIES tokens and their representations.
If you now wish to fine-tune the model with target text that requires a different tokenization strategy or vocabulary, the model will not be able to understand such tokens unless this type of data was included during the pre-training stage. Incorporating text at this point without pre-training the model on it would likely require substantial modifications and potentially re-training from scratch to ensure proper understanding.
Hi, I'm just wondering if it's possible to fine-tune the model using target text that differs from SELFIES and would require a different tokenization strategy (with a different vocab). I don't understand too well right now if this case is already covered in the implementation, or would require substantial modification.
I can see here (in
finetune.py
) that you modify the standard BART tokenizer to handle SELFIES vocab. Would it be necessary to do something similar fortokenizer.decoder
too with a different vocabulary?Apologies if this is trivial.
The text was updated successfully, but these errors were encountered: