I've got a k-means implementation, and wondering how can I measure it's effectiveness on my data set.
The details:
My sample data is from a 12-96 dimension space, have about 1000-100000 samples at every run. The data is not well separated, but outliers can occur.
I use the following initialization: I run the k-means N times on different subsamples, and cluster my k*N results with single linkage. I use the means of these clusters as my initial seed. (N is about 10, and the subsamples are the size of my original samples divided by N. I use different subset every time.) This is the initialization method suggested by Arai and Barakbah: https://portal.dl.saga-u.ac.jp:8443/bitstream/123456789/54922/1/ZR00005460.pdf With the modification of not using the whole sample set for the initialization. (If you have another recommendation, please let me know. I use this because its low memory consumption and I have good control over the calculation time.)
[link] [1 comment]