unexpected similarity based on embedding #34

VedaHung · 2024-08-23T09:07:32Z

Dear experts,

Thanks for building the model for Chinese.
When I tried to use your model to calculate semantic similarity (see bellowed)
import torch
from transformers import (BertTokenizerFast,AutoModel,)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
albert_model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')

def encode_text(text):
text_code = tokenizer(text,padding=True,truncation=True,return_tensors='pt')
input_ids = text_code['input_ids']
attention_mask = text_code['attention_mask']
token_type_ids = text_code['token_type_ids']
print('input_ids',input_ids)
print('attention_mask',attention_mask)
print('token_type_ids',token_type_ids)
with torch.no_grad():
output = albert_model(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
embed = output.pooler_output
return embed
cs1=cosine_similarity(encode_text('蘋果'),encode_text('鳳梨'))
cs2=cosine_similarity(encode_text('蘋果'),encode_text('塑膠'))

I expected to see cs1>cs2 but it is not the case. I wonder how do you interpret the results which higher similarity occurs between unrelated words? I wonder what can I do to improve the results of semantic related from your model Thanks!

Sincerely,
Veda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected similarity based on embedding #34

unexpected similarity based on embedding #34

VedaHung commented Aug 23, 2024

unexpected similarity based on embedding #34

unexpected similarity based on embedding #34

Comments

VedaHung commented Aug 23, 2024