Skip to content

Latest commit

 

History

History
43 lines (19 loc) · 4.53 KB

README.md

File metadata and controls

43 lines (19 loc) · 4.53 KB

clusteringAlgos

Implementations & testing of K-means, K-medoids, Divisive and Agglomerative clustering algorithms.

K-means

This algorithm is that the points are assigned to the "averages" of the clusters, which are calculated from the average sum of the coordinates of all the points in the cluster. These centers are recalculated until the iteration takes place without changing the assignment of any point from our set of points, which means that the solution converged to the local optimum.

K-medoids

The algorithm is similar to K-means, except that the points are not assigned to the "average" of the clusters but to their most central point. At the beginning, the centers are also randomly initialized and assigned points. However, then the recalculation of the new potential centers is performed, and the point with the smallest distance from all the others in the cluster is selected as the new center. This process is also repeated until the iteration occurs without a new assignment.

Agglomerative clustering

Agglomerative clustering is a type of hierarchical clustering that goes from the bottom up. At the beginning of the algorithm, each point is a separate cluster, and each iteration of the algorithm merges the two nearest clusters. The distance between the two clusters can be determined in different ways, I chose Single-linkage, which means that the distance between the two clusters determines the smallest distance between their points.

Divisive clustering

Divisional clustering works on the opposite principle as agglomerative. This approach initially treats all points as a single cluster, which is then divided into smaller parts. By using the K-means algorithm for cluster division, an efficient solution is achieved. I always share the cluster with the most points.

Comparison

The graphs below show the trend of decreasing / increasing time intensity with the change of the parameter k. Each algorithm was run at 20,000 points, except for the agglomerative, which was run at only 1,000 points. We see that in the agglomerative clustering, parameter k does not play a role at all. This is because the algorithm iterates n minus k times, so there is only a small difference. However, with K-Means and Division, the trend is increasing, which is again understandable, because the algorithms have to iterate several times, but this time the parameter has more weight. With the K-Medoid algorithm, the trend is declining, because when there are few large clusters, it has to swap many times before the solution converges.

By far the best algorithm is K-Means. It is fast and gives relatively good splits. The best distribution was given by the Agglomerative approach, but at a huge time cost, so it cannot be considered effective. Divisional clustering was also nice, but because it always divides clusters, and can not exchange points between clusters does not search a large enough state space. At least not always. K-medoids was the worst, lasting too long for the results.