Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63752

Need help with a clustering algorithm

$
0
0

Hello, fellow redditors! Sorry if it isn't the place to post this, but I'm not sure where to get help.

I'm an undergraduate student from Brazil doing some research with clustering algorithms. I have an instructor, and my job was simply to implement an clustering algorithm on a teacher's framework (that she did for her PhD). The problem is that it never returns the correct results, even for it's sample dataset. After much time debugging the code, I found there is no bug... the thing is, it's not returning the correct result because the algorithm idea looks strange, and wrong for me, and it returns just one cluster (out of as many as you want).

The algorithm is a variation of the CLIQUE algorithm. It's an subspace and density-based algorithm, that uses grids and works exploiting something called "dimension monotonicity". Basically, it divides the dataset in grids, and then try to find the cells (of the grid) that have more objects than a certain threshold. It marks these cells as dense, and then proceed to combine them to form larger dimensions candidate clusters (candidate because they may not form a cluster if they are not dense as well), and then repeat the process. To combine the clusters, the CLIQUE algorithm uses the dimension's monotonicity lemma:

Lemma: If a collection of points S is a cluster in a k-dimensional space, then S is also part of a cluster in any (k-1)-dimensional projections of this space.

Proof: A k-dimensional cluster C includes the points that fall inside a union of k-dimensional dense units. Since the units are dense, the selectivity (selectivity is the fraction of total data points contained in the unit) of each one is at least r (the threshold). All the projections of any unit u in C have at least as large selectivity, because they include all points inside u, and therefore are also dense. Since the units of the cluster are connected, their projections are also connected. It follows that the projections of the points in C lie in the same cluster in any (k-1)-dimensional projection. QED.

I can agree with that! The problem is that it specifies a fixed threshold (a input parameter). The algorithm that I'm working with don't use a fixed threshold, instead, it calculates a different threshold for each dimension, and then, for each candidate cluster created with the lower dimensions, it checks if it's dense for the threshold for each dimension in the cluster (i.e. it checks if the candidate cluster selectivity is higher than the threshold for each dimension that forms the cluster). The threshold formula is this (the paper doesn't prove it works, just point it works because of the lemma above, but I don't think the lemma apply because of the variable threshold):

Threshold = N*a*1.5/D 

Being 'N' the number of objects in the dataset, 'a' the bin size, 'D' the dimension size and 1.5 a constant. What it's trying to do is check if the bin is 1.5 times more dense that it was going to be if the dimension was uniform. It looks like a good idea. The problem is that the monotonicity lemma doesn't apply here, and he doesn't even prove it applies. I'm going to give an example to where I think it doesn't apply, i.e. there is a dense cluster on higher dimensions that isn't found on the lower dimensions:

Suppose I have a dataset with 2 clusters and 2 dimensions, both dimensions with size 10. The first cluster have 100 points uniformly distributed on 0~5 on the first dimension and 0~10 on the second. The second cluster have 8 objects, from 6~7 on the dimension 1, and from 0~4 on the second dimension. Now, both clusters have the same density (size 50 for 100 objects, size 4 for 8 objects), but the first cluster is so big that it "hides" the second cluster. The threshold for the first dimension and second cluster is 108*1*1.5/10. As we can see, it's a lot more than 8. Now the first cluster threshold and first dimension is 108*5*1.5/10=81, and so the first cluster would be considered dense, but it's density is the same as the second one. It look really flawed to make a formula like this, because it neglects that the cluster can be really big on the other dimensions, and so it will be really dense on some parts of the projection on lower dimensions (we don't want to find the biggest cluster, but the dense ones).

I think that is the problem, because the only cluster it finds is the biggest one, but all clusters have the same density. But it was some guy PhD thesis, and it's really intelligent on the other parts, so it's really strange to have such a error (and he made tests and all these things, showing good results)... someone could help me find the flaw in my logic? Or it's really kind of wrong?

Thanks in advance!

P.S.: Yeah, I avoided saying the name of the algorithm because it's not well known (just a few papers citing), and it would be bad if one of the first results was something bad about it, but I could say it if necessary. Sorry for the poor english and the wall of text.

submitted by Sohakes
[link] [2 comments]

Viewing all articles
Browse latest Browse all 63752

Trending Articles