You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another practical use case for the algorithm could be that you have e.g. a huge list of customer names and addresses where you want to find out if some of them accidentally got assigned more than one customer id over the years. From my practical experience dealing with exactly such a dataset, there might be a single percentage number of the customer numbers that are indeed potential doubles, but for each individual customer one can safely assume that there are no more than 3 or 4 different numbers.
In those situations one doesn't have to search all potential matches between each element of list A and each element of a different list B, but between all the elements of list A against itself. Here one could imho save roughly 50% of scalar product computations by only calculating the upper or lower triangle of A x A^T, ignoring the main diagonal.
Maybe you consider this too similar to issue #24, in which case you can just delete it or flag it correspondingly. imho it isn't exactly as at least for the above practical use case one doesn't necessarily need to construct the not-calculated lower triangle from the calculated upper triangle or vice versa if one uses a sufficiently high max_fits (which in the end might just turn this into a trade-off between using more memory or being faster).
The text was updated successfully, but these errors were encountered:
trichie
changed the title
Idea: Speed-up of awesome_cosine_similarity by hopefully approx. 50% when comparing a list to itself
Idea: Potential speed-up of awesome_cosine_similarity when comparing certain lists to themselves
Jun 24, 2022
Another practical use case for the algorithm could be that you have e.g. a huge list of customer names and addresses where you want to find out if some of them accidentally got assigned more than one customer id over the years. From my practical experience dealing with exactly such a dataset, there might be a single percentage number of the customer numbers that are indeed potential doubles, but for each individual customer one can safely assume that there are no more than 3 or 4 different numbers.
In those situations one doesn't have to search all potential matches between each element of list A and each element of a different list B, but between all the elements of list A against itself. Here one could imho save roughly 50% of scalar product computations by only calculating the upper or lower triangle of A x A^T, ignoring the main diagonal.
Maybe you consider this too similar to issue #24, in which case you can just delete it or flag it correspondingly. imho it isn't exactly as at least for the above practical use case one doesn't necessarily need to construct the not-calculated lower triangle from the calculated upper triangle or vice versa if one uses a sufficiently high max_fits (which in the end might just turn this into a trade-off between using more memory or being faster).
The text was updated successfully, but these errors were encountered: