I'm working on an implementation of kmeans++ and for the clusters I get that have about 30 datapoints assigned to them, it seems to work pretty well. But I consistently seem to get one or two clusters of about 700 datapoints in size (out of ~2000!) when k is 100, which turn out to be pretty useless. I also get a bunch of clusters that are 1 or two datapoints in size, which look to be pretty similar to other clusters of that size and should be grouped together.
Are there methods to ensure that clusters get to be a roughly homogeneous size compared to each other? I thought it was a problem with my cluster initialization, but I tried random datapoints at centers, I tried furthest distance, and now kmeans++, but the different methods all seem to give me the same problem with regards to relative cluster size.
[link] [29 comments]