This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.
All datasets can be found data folder.
data points | clusters | dimension |
---|---|---|
2990 |
10 |
2 |
Handl and J. Knowles, “Multiobjective clustering with automatic determination of the number of clusters,” UMIST, Tech. Rep., 2004.
data points | clusters | dimension |
---|---|---|
788 |
7 |
2 |
Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.
data points | clusters | dimension |
---|---|---|
1000 |
2 |
3 |
Alfred Ultsch, Clustering with SOM: U*C, in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.
data points |
3100 |
clusters |
31 |
dimensions |
2 |
* ARFF |
Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.
data points | clusters | dimension |
---|---|---|
577 |
3 |
2 |
C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,” International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.
data points | clusters | dimension |
---|---|---|
8000 |
7 |
2 |
Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/ ̃cluto.
This project contains set of clustering methods benchmarks on various dataset. The project is dependent on [Clueminer project](https://github.com/deric/clueminer).
in order to run benchmark compile dependencies into a single JAR file:
mvn assembly:assembly
allows running repeated runs of the same algorithm:
./run consensus --dataset "triangle1" --repeat 10
by default k-means algorithm is used.
For available datasets see [resources folder](https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial).