Fine-tuning on target text requiring different tokenization/vocabulary #23

fedeotto · 2025-01-17T10:58:04Z

Hi, I'm just wondering if it's possible to fine-tune the model using target text that differs from SELFIES and would require a different tokenization strategy (with a different vocab). I don't understand too well right now if this case is already covered in the implementation, or would require substantial modification.

def data_init(self):
    self.tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
    unwanted_words = [i for i in self.tokenizer.encoder.keys()]
    
    important_tokens = ['<s>','<pad>','</s>','<unk>']
    unwanted_words = list(set(unwanted_words).difference(set(important_tokens)))
    for word in unwanted_words:
        del self.tokenizer.encoder[word]
    selfies_tokens = np.load('../moldata/vocab_list/zinc.npy').tolist()
    self.tokenizer.add_tokens(selfies_tokens, special_tokens=False)
    self.tokenizer.add_tokens('<mask>', special_tokens=True)
    self.model.resize_token_embeddings(len(self.tokenizer))
    self.model.load_state_dict(torch.load(self.args.checkpoint_path, map_location='cpu'),strict=False)

I can see here (in finetune.py) that you modify the standard BART tokenizer to handle SELFIES vocab. Would it be necessary to do something similar for tokenizer.decoder too with a different vocabulary?

Apologies if this is trivial.

The text was updated successfully, but these errors were encountered:

Fangyinfff · 2025-01-18T05:43:52Z

Thank you for your interest in our work!

You are correct that we modified the standard BART tokenizer to include SELFIES tokens during pre-training. However, to clarify, our pre-training was conducted from scratch using only SELFIES data. As a result, the model was specifically trained to understand SELFIES tokens and their representations.

If you now wish to fine-tune the model with target text that requires a different tokenization strategy or vocabulary, the model will not be able to understand such tokens unless this type of data was included during the pre-training stage. Incorporating text at this point without pre-training the model on it would likely require substantial modifications and potentially re-training from scratch to ensure proper understanding.

I hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning on target text requiring different tokenization/vocabulary #23

Fine-tuning on target text requiring different tokenization/vocabulary #23

fedeotto commented Jan 17, 2025 •

edited

Loading

Fangyinfff commented Jan 18, 2025

Fine-tuning on target text requiring different tokenization/vocabulary #23

Fine-tuning on target text requiring different tokenization/vocabulary #23

Comments

fedeotto commented Jan 17, 2025 • edited Loading

Fangyinfff commented Jan 18, 2025

fedeotto commented Jan 17, 2025 •

edited

Loading