Hi r/ML,
First, I'm not sure if this subreddit accepts questions like this. Please redirect me if it's better posted elsewhere.
The presenting problem I have is that running k-means clustering on my data set results in almost everything being put in one cluster. I've tried varying the number of clusters but it keeps happening. I'd like to understand why this is happening and what my options are for getting a more even, or more informative, distribution between clusters.
Some background: I've gathered a list of friends and their likes from Facebook, making a sparse matrix like this:
data = { 'friend1': ['like1', 'like2', 'like3'], 'friend2': ['like4'], 'friend3': ['like1', 'like4'], }
There's about 100 friends with about 1 to 300 likes each. I make k centroids, giving each a random set of likes, and count the shared likes for my distance function (dist = 1 - total_shared_likes / 1000).
Each iteration I calculate the distance between a friend and each centroid and assign them to the nearest. Almost all end up in the same centroid. To move the centroids to the average position of their members, I total all the user and centroid likes, then divide by the number of users + 1 (for the centroid's likes). Each users' likes count for 1, but the centroid likes are floats.
Any thoughts on why this is happening? Am I missing something or doing something wrong? Or is this data set not amenable to k-means, should I try a different algorithm?
Thanks!
edit Still getting all friends in the same cluster despite
(i) creating a euclidian distance function
def distance(self, other): dist = 0.0 for i, l in enumerate(self.likes): if l != other.likes[i]: dist += 1 return math.sqrt(diet)
The likes variables hold tuples where each item is 1 or 0 (like or dislike), and all users' tuples are the same length (equal to the number of all possible items to like). I'm adding 1 if they're different as this is the only possible distance (and it's 1 squared, of course).
(ii)
On initialisation, assigning a random friends' likes to the first centroid, then for each other centroid (up to k) assigning the likes of the friend furthest from all previous centroids.
(iii)
On each iteration, if a centroid has no members, it's assigned the likes of the friend furthest from all centroids.
(iv)
No longer including the centroid's likes when calculating the average position between its members.
I've put my code on paste bin as I might have overlooked something - http://pastebin.com/ZFvFfFe3
Here's some example output:
--- iteration 4 {'cluster 0': {'dist': 1256.4364795929926, 'members': 174}, 'cluster 1': {'dist': 0.0, 'members': 0}, 'cluster 2': {'dist': 0.0, 'members': 0}, 'cluster 3': {'dist': 0.0, 'members': 0}, 'cluster 4': {'dist': 0.0, 'members': 0}} --- iteration 5 {'cluster 0': {'dist': 0.0, 'members': 0}, 'cluster 1': {'dist': 1256.4364795929926, 'members': 174}, 'cluster 2': {'dist': 0.0, 'members': 0}, 'cluster 3': {'dist': 0.0, 'members': 0}, 'cluster 4': {'dist': 0.0, 'members': 0}}
Thanks for all of your input, I'm very grateful. Please keep posting feedback, I'm set on solving this!
edit 2
Fixed the distance function
def distance(self, other): total_dist = 0.0 for i, l in enumerate(self.likes): if l != other.likes[i]: total_dist += (l - other.likes[i])**2 return math.sqrt(total_dist / len(self.likes))
And the error in the centroid update calculation (line 126)
new_likes[i] += user_vote
The paste bin code is updated, it's still behaving as before.
[link] [30 comments]