Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

trichie · 2022-06-24T11:07:59Z

Another practical use case for the algorithm could be that you have e.g. a huge list of customer names and addresses where you want to find out if some of them accidentally got assigned more than one customer id over the years. From my practical experience dealing with exactly such a dataset, there might be a single percentage number of the customer numbers that are indeed potential doubles, but for each individual customer one can safely assume that there are no more than 3 or 4 different numbers.

In those situations one doesn't have to search all potential matches between each element of list A and each element of a different list B, but between all the elements of list A against itself. Here one could imho save roughly 50% of scalar product computations by only calculating the upper or lower triangle of A x A^T, ignoring the main diagonal.

Maybe you consider this too similar to issue #24, in which case you can just delete it or flag it correspondingly. imho it isn't exactly as at least for the above practical use case one doesn't necessarily need to construct the not-calculated lower triangle from the calculated upper triangle or vice versa if one uses a sufficiently high max_fits (which in the end might just turn this into a trade-off between using more memory or being faster).

trichie changed the title ~~Idea: Speed-up of awesome_cosine_similarity by hopefully approx. 50% when comparing a list to itself~~ Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

trichie commented Jun 24, 2022 •

edited

Loading

Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves #65

Comments

trichie commented Jun 24, 2022 • edited Loading

trichie commented Jun 24, 2022 •

edited

Loading