Ask r/ML: k-means clustering is putting everything in one cluster, how come?

Hi r/ML,

First, I'm not sure if this subreddit accepts questions like this. Please redirect me if it's better posted elsewhere.

The presenting problem I have is that running k-means clustering on my data set results in almost everything being put in one cluster. I've tried varying the number of clusters but it keeps happening. I'd like to understand why this is happening and what my options are for getting a more even, or more informative, distribution between clusters.

Some background: I've gathered a list of friends and their likes from Facebook, making a sparse matrix like this:

data = { 'friend1': ['like1', 'like2', 'like3'], 'friend2': ['like4'], 'friend3': ['like1', 'like4'], }

There's about 100 friends with about 1 to 300 likes each. I make k centroids, giving each a random set of likes, and count the shared likes for my distance function (dist = 1 - total_shared_likes / 1000).

Each iteration I calculate the distance between a friend and each centroid and assign them to the nearest. Almost all end up in the same centroid. To move the centroids to the average position of their members, I total all the user and centroid likes, then divide by the number of users + 1 (for the centroid's likes). Each users' likes count for 1, but the centroid likes are floats.

Any thoughts on why this is happening? Am I missing something or doing something wrong? Or is this data set not amenable to k-means, should I try a different algorithm?

Thanks!

edit Still getting all friends in the same cluster despite

(i) creating a euclidian distance function

def distance(self, other): dist = 0.0 for i, l in enumerate(self.likes): if l != other.likes[i]: dist += 1 return math.sqrt(diet)

The likes variables hold tuples where each item is 1 or 0 (like or dislike), and all users' tuples are the same length (equal to the number of all possible items to like). I'm adding 1 if they're different as this is the only possible distance (and it's 1 squared, of course).

(ii)

On initialisation, assigning a random friends' likes to the first centroid, then for each other centroid (up to k) assigning the likes of the friend furthest from all previous centroids.

(iii)

On each iteration, if a centroid has no members, it's assigned the likes of the friend furthest from all centroids.

(iv)

No longer including the centroid's likes when calculating the average position between its members.

I've put my code on paste bin as I might have overlooked something - http://pastebin.com/ZFvFfFe3

Here's some example output:

--- iteration 4 {'cluster 0': {'dist': 1256.4364795929926, 'members': 174}, 'cluster 1': {'dist': 0.0, 'members': 0}, 'cluster 2': {'dist': 0.0, 'members': 0}, 'cluster 3': {'dist': 0.0, 'members': 0}, 'cluster 4': {'dist': 0.0, 'members': 0}} --- iteration 5 {'cluster 0': {'dist': 0.0, 'members': 0}, 'cluster 1': {'dist': 1256.4364795929926, 'members': 174}, 'cluster 2': {'dist': 0.0, 'members': 0}, 'cluster 3': {'dist': 0.0, 'members': 0}, 'cluster 4': {'dist': 0.0, 'members': 0}}

Thanks for all of your input, I'm very grateful. Please keep posting feedback, I'm set on solving this!

edit 2

Fixed the distance function

def distance(self, other): total_dist = 0.0 for i, l in enumerate(self.likes): if l != other.likes[i]: total_dist += (l - other.likes[i])**2 return math.sqrt(total_dist / len(self.likes))

And the error in the centroid update calculation (line 126)

new_likes[i] += user_vote

The paste bin code is updated, it's still behaving as before.

submitted by projector
[link] [30 comments]

Ask r/ML: k-means clustering is putting everything in one cluster, how come?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List