Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large datasets #5

Open
RoufaidaLaidi opened this issue Sep 15, 2021 · 1 comment
Open

Large datasets #5

RoufaidaLaidi opened this issue Sep 15, 2021 · 1 comment

Comments

@RoufaidaLaidi
Copy link

Hi, I am trying to run the code on a large dataset, however I face a memory issue when generating the similarity matrix. Since the algorithme create a graph node for each feature vector, the size of the similarity matrix is n*n, where n is the number of feature vectors. How do you suggest to overcome the problem and run the code on a dataset of millions of samples ? thanks in advance.

@spindro
Copy link
Owner

spindro commented Sep 17, 2021

Hi, this is one of the downsides of our method and the codebase is starting to show its age. You could try to build the graph starting from a KNN algorithm. Many KNN algorithms are in fact extremely optimized for this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants