Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Ward Linkage #2

Open
shawn0wang opened this issue Oct 23, 2023 · 3 comments
Open

About Ward Linkage #2

shawn0wang opened this issue Oct 23, 2023 · 3 comments

Comments

@shawn0wang
Copy link

I create a new question about when the Ward Linkage will rdy ^-^, sometimes I often use Ward Linkage cause it result better than others

BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger the result is better or max_merge_distance is smaller the result is better
I think The larger the Cosine distance, the more similar the two sentences are, so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.

thanks

@porterehunley
Copy link
Owner

Ward linkage is our first priority right now. It is a bit more complicated because we need to finish a large refactor to be able to add more and more linkage functions. It will most certainly be ready this week.

BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger 
the result is better or max_merge_distance is smaller the result is better
I think The larger the Cosine distance, the more similar the two sentences are, 
so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.

raising the max_merge_distance will increase the number of clusters because you are loosening the criteria for a merge. This is the same as distance threshold in agglomerative clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

@shawn0wang
Copy link
Author

shawn0wang commented Oct 24, 2023

I'm looking forward to the Ward linkage and other features
Thank you for providing this repository, it is very helpful for me !

@shawn0wang
Copy link
Author

shawn0wang commented Oct 24, 2023

Another question: When I have many CPU cores, my batch_size should be set larger. Will the calculation result be faster or should it be set smaller?

exp: if I have 100 CPU cores and 200,000 data ,what batch_size should be set will calculate faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants