Large datasets #5

RoufaidaLaidi · 2021-09-15T13:55:33Z

Hi, I am trying to run the code on a large dataset, however I face a memory issue when generating the similarity matrix. Since the algorithme create a graph node for each feature vector, the size of the similarity matrix is n*n, where n is the number of feature vectors. How do you suggest to overcome the problem and run the code on a dataset of millions of samples ? thanks in advance.

spindro · 2021-09-17T15:28:24Z

Hi, this is one of the downsides of our method and the codebase is starting to show its age. You could try to build the graph starting from a KNN algorithm. Many KNN algorithms are in fact extremely optimized for this scenario.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large datasets #5

Large datasets #5

RoufaidaLaidi commented Sep 15, 2021

spindro commented Sep 17, 2021

Large datasets #5

Large datasets #5

Comments

RoufaidaLaidi commented Sep 15, 2021

spindro commented Sep 17, 2021