Skip to content

Latest commit

 

History

History
151 lines (102 loc) · 4.97 KB

README-old.asc

File metadata and controls

151 lines (102 loc) · 4.97 KB

Clustering datasets

Datasets

This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.

All datasets can be found data folder.

2d-10c

data points clusters dimension

2990

10

2

2d-10c
  1. Handl and J. Knowles, “Multiobjective clustering with automatic determination of the number of clusters,” UMIST, Tech. Rep., 2004.

atom

data points clusters dimension

800

2

3

atom

aggregation

data points clusters dimension

788

7

2

aggregation

Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.

data points clusters dimension

1000

2

3

chainlink

Alfred Ultsch, Clustering with SOM: U*C, in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.

D31

data points

3100

clusters

31

dimensions

2

image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/D31.png["D31",400,float="left"]

* ARFF

Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.

3MC

data points clusters dimension

400

3

2

3MC

DS577

data points clusters dimension

577

3

2

D31
  1. C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,” International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.

cluto-t4_8k

data points clusters dimension

8000

7

2

cluto-t4_8k
  1. Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/ ̃cluto.

Experiments

This project contains set of clustering methods benchmarks on various dataset. The project is dependent on [Clueminer project](https://github.com/deric/clueminer).

in order to run benchmark compile dependencies into a single JAR file:

mvn assembly:assembly

Consensus experiment

allows running repeated runs of the same algorithm:

./run consensus --dataset "triangle1" --repeat 10

by default k-means algorithm is used.