Clustering datasets

Datasets

This project contains collection of labeled clustering problems that can be found in the literature. Most of datasets were artificially created.

All datasets can be found data folder.

2d-10c

data points	clusters	dimension
2990	10	2

ARFF
generator

Handl and J. Knowles, “Multiobjective clustering with automatic determination of the number of clusters,” UMIST, Tech. Rep., 2004.

atom

data points	clusters	dimension
800	2	3

source: FCPS
ARFF

aggregation

data points	clusters	dimension
788	7	2

ARFF
original source

Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.

chainlink

data points	clusters	dimension
1000	2	3

source: FCPS
ARFF

Alfred Ultsch, Clustering with SOM: U*C, in Proc. Workshop on Self Organizing Feature Maps ,pp 31-37 Paris 2005.

D31

data points	3100
clusters	31
dimensions	2
image::https://github.com/deric/clustering-benchmark/blob/images/fig/artificial/D31.png["D31",400,float="left"]	* ARFF

Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.

3MC

data points	clusters	dimension
400	3	2

DS577

data points	clusters	dimension
577	3	2

ARFF

C. Su, C. H. Chou, and C. C. Hsieh, “Fuzzy C-Means Algorithm with a Point Symmetry Distance,” International Journal of Fuzzy Systems, vol. 7, no. 4, pp. 175-181, 2005.

cluto-t4_8k

data points	clusters	dimension
8000	7	2

ARFF

Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/ ̃cluto.

Experiments

This project contains set of clustering methods benchmarks on various dataset. The project is dependent on [Clueminer project](https://github.com/deric/clueminer).

in order to run benchmark compile dependencies into a single JAR file:

mvn assembly:assembly

Consensus experiment

allows running repeated runs of the same algorithm:

./run consensus --dataset "triangle1" --repeat 10

by default k-means algorithm is used.

For available datasets see [resources folder](https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-old.asc

README-old.asc

Clustering datasets

Datasets

2d-10c

atom

aggregation

chainlink

D31

3MC

DS577

cluto-t4_8k

Experiments

Consensus experiment

Files

README-old.asc

Latest commit

History

README-old.asc

File metadata and controls

Clustering datasets

Datasets

2d-10c

atom

aggregation

chainlink

D31

3MC

DS577

cluto-t4_8k

Experiments

Consensus experiment