Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected similarity based on embedding #34

Open
VedaHung opened this issue Aug 23, 2024 · 0 comments
Open

unexpected similarity based on embedding #34

VedaHung opened this issue Aug 23, 2024 · 0 comments

Comments

@VedaHung
Copy link

Dear experts,

Thanks for building the model for Chinese.
When I tried to use your model to calculate semantic similarity (see bellowed)
import torch
from transformers import (BertTokenizerFast,AutoModel,)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
albert_model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')

def encode_text(text):
text_code = tokenizer(text,padding=True,truncation=True,return_tensors='pt')
input_ids = text_code['input_ids']
attention_mask = text_code['attention_mask']
token_type_ids = text_code['token_type_ids']
print('input_ids',input_ids)
print('attention_mask',attention_mask)
print('token_type_ids',token_type_ids)
with torch.no_grad():
output = albert_model(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
embed = output.pooler_output
return embed
cs1=cosine_similarity(encode_text('蘋果'),encode_text('鳳梨'))
cs2=cosine_similarity(encode_text('蘋果'),encode_text('塑膠'))

I expected to see cs1>cs2 but it is not the case. I wonder how do you interpret the results which higher similarity occurs between unrelated words? I wonder what can I do to improve the results of semantic related from your model Thanks!

Sincerely,
Veda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant