Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results difficult to explain #16

Open
lbonansbrux opened this issue Apr 24, 2020 · 2 comments
Open

Results difficult to explain #16

lbonansbrux opened this issue Apr 24, 2020 · 2 comments

Comments

@lbonansbrux
Copy link

Dear Rob,
I do not know whether this is a bug or not but I have strange results, as per follows.
I compare the embeddings of two words, and the average (on 768 values) absolute difference is lower for different word than for synonyms.

I would have expected a lower difference for rich and a greater for poor. Where am I actually wrong?
Thank you.

Example 1:

String 1: wealthy
String 2: poor
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.23239951	2.0
-0.0073103756	-0.057594057	5.0
0.09099525	0.11997495	3.0
...
Absolute difference average : 8

Example 2:

String 1: wealthy
String 2: blue
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.29995522	9.0
-0.0073103756	-0.19767939	19.0
...
Absolute difference average : 16

Example 3:

String 1: wealthy
String 2: rich
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.14642045	7.0
-0.0073103756	-0.108990476	10.0
0.09099525	0.25123212	16.0
0.069340415	-0.12602457	20.0
...
Absolute difference average : 11

Example 4:

String 1: wealthy
String 2: black
Embedding 1	Embedding 2	100 * absolute difference
0.21383394	0.22277042	1.0
-0.0073103756	-0.25720397	25.0
0.09099525	0.16640717	8.0
...
Absolute difference average : 11
@robrua
Copy link
Owner

robrua commented Apr 25, 2020

Hey,

Are you using the token embeddings or the sequence embeddings in this case?

In my experience, the BERT sequence embeddings in particular (but sometimes also the token embeddings) don't do as good a job in raw distance calculations for semantic similarity as some other models. This is basically just a result of the tasks BERT is trained for and the transformer architecture it uses. Generally you might have better luck with cosine distance, as you won't have to worry about effects from embedding magnitudes.

That said, if you're looking to do this sort of thing (especially with individual words), you might want to look into a different model like Universal Sentence Encoder, ELMo, GloVe, etc. that's designed to better support semantic similarity w/simple distance metrics.

@lbonansbrux
Copy link
Author

Hi Rob!

Thanks for your feedback.

I am using the sequence embedding that returns a float[]. Token embeddings return a float[][] and I don't know what to do with it to calculate a cosine similarity. An idea?

Following your advice, indeed the cosine similarity seems more reliable than a Manhattan or Euclidian distance, as per following Series 1 examples. Note however that, in Series 1, poor has a higher score (similarity) than wealthy with rich.
But it remains not satisfying as per Series 2 example (where poor is far closer to wealthy than rich).
Unless the situation can improve with token embeddings, I'm not sure that similarity in an n-dim space allows me to conclude precisely on same meaning between two words.

I will try with other models (which may be more difficult to use in Java but this is another story).

===================================================
EXAMPLES - FIRST SERIES
===================================================
Strings 1 & 2 : rich	wealthy
Embedding 1	Embedding 2	|Difference|
0.14642045	0.21383394	0.06741349399089813
-0.108990476	-0.0073103756	0.10168009996414185
0.25123212	0.09099525	0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : rich	poor
Embedding 1	Embedding 2	|Difference|
0.14642045	0.23239951	0.08597905933856964
-0.108990476	-0.057594057	0.05139641836285591
0.25123212	0.11997495	0.13125717639923096
...
Cosinus similarity = 0.8495916385797545
Manhattan distance : 87.00668240943924
Euclidian distance : 4.075972602921788
===================================================
Strings 1 & 2 : rich	yellow
Embedding 1	Embedding 2	|Difference|
0.14642045	0.22856697	0.0821465253829956
-0.108990476	-0.30353695	0.19454647600650787
0.25123212	0.27586222	0.024630099534988403
...
Cosinus similarity = 0.8746363539860872
Manhattan distance : 83.51396567281336
Euclidian distance : 3.9368254652299233
===================================================
Strings 1 & 2 : rich	blue
Embedding 1	Embedding 2	|Difference|
0.14642045	0.29995522	0.15353477001190186
-0.108990476	-0.19767939	0.0886889100074768
0.25123212	0.30732605	0.05609393119812012
...
Cosinus similarity = 0.8479855400362315
Manhattan distance : 97.23932060459629
Euclidian distance : 4.635928430207843
===================================================
Strings 1 & 2 : rich	dumb
Embedding 1	Embedding 2	|Difference|
0.14642045	0.12908244	0.01733800768852234
-0.108990476	-0.031146867	0.07784360647201538
0.25123212	0.24095681	0.010275304317474365
...
Cosinus similarity = 0.8809615766086131
Manhattan distance : 77.28109940420836
Euclidian distance : 3.688068948233406
===================================================
EXAMPLES - SECOND SERIES
===================================================
Strings 1 & 2 : wealthy	rich
Embedding 1	Embedding 2	|Difference|
0.21383394	0.14642045	0.06741349399089813
-0.0073103756	-0.108990476	0.10168009996414185
0.09099525	0.25123212	0.16023686528205872
...
Cosinus similarity = 0.8456862351607601
Manhattan distance : 87.27405425067991
Euclidian distance : 4.0416941747361665
===================================================
Strings 1 & 2 : wealthy	poor
Embedding 1	Embedding 2	|Difference|
0.21383394	0.23239951	0.01856556534767151
-0.0073103756	-0.057594057	0.050283681601285934
0.09099525	0.11997495	0.028979696333408356
...
Cosinus similarity = 0.9146569176233622
Manhattan distance : 64.79049000190571
Euclidian distance : 2.9332529214490783
===================================================
Strings 1 & 2 : wealthy	yellow
Embedding 1	Embedding 2	|Difference|
0.21383394	0.22856697	0.014733031392097473
-0.0073103756	-0.30353695	0.2962265610694885
0.09099525	0.27586222	0.18486696481704712
...
Cosinus similarity = 0.7631069329907343
Manhattan distance : 107.96488573867828
Euclidian distance : 5.212087989639763
===================================================
Strings 1 & 2 : wealthy	blue
Embedding 1	Embedding 2	|Difference|
0.21383394	0.29995522	0.08612127602100372
-0.0073103756	-0.19767939	0.19036900997161865
0.09099525	0.30732605	0.21633079648017883
...
Cosinus similarity = 0.7371959850353489
Manhattan distance : 124.55361186526716
Euclidian distance : 5.906763527768454
===================================================
Strings 1 & 2 : wealthy	dumb
Embedding 1	Embedding 2	|Difference|
0.21383394	0.12908244	0.08475150167942047
-0.0073103756	-0.031146867	0.023836491629481316
0.09099525	0.24095681	0.14996156096458435
...
Cosinus similarity = 0.7449719286008458
Manhattan distance : 101.83741049654782
Euclidian distance : 5.109428840001488
===================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants