Hi all!
I am researching clustering algorithms for data that is not numeric. I am not at liberty to discuss exactly on what data I'm clustering (it's proprietary information), but I can give you an idea of the kind of data I have. Also, I am somewhat new to this area of study. Please let me know if my understanding is off and I should be trying different things.
So, pretty much all of the clustering algorithms I've come across so far use some notion of distance to find clusters. This is fine if your data is numeric and it has some intrinsic value that we find meaningful. For example, we would consider the value of a housing price meaningful because we could compare two instances of it in a helpful way: $200k - $150k = $50k lets us know that the value of the first price is more than the second. A distance measure would thus be helpful in this case. This is not so for categorical data. Categorical data is where the instances of the feature all fall into a set with no natural ordering. Car makes, for example, would be considered categorical data. We would find no meaning in comparing the names of two manufacturers in the same way we did the housing prices: Ford - Toyota = ??? is not very helpful. Even if we assigned numbers to the categories, the distances we would find would be arbitrary and hence meaningless.
My data is a mix of these two things. I have numeric, categorical, and boolean features in my data sets (I consider boolean data a subset of categorical data, although we might find boolean distances meaningful with proper scaling). My research so far has turned up these two promising-ish papers:
Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach
Rock: A Robust Categorical Algorithm for Clustering Attributes
I don't know if these are quite what I'm looking for. They seem a bit hand-wavey about certain things (especially the first), but the other papers I found were much worse. Is there anyone on here who has experience with trying to cluster this type of data? If so, could you point me in the direction of some decent papers to read up on?
Thanks in advance!
tl;dr - I need to cluster mixed categorical and numeric data. Know of any good papers?
edit: I was told you guys would be more suited to answer this question. Please let me know if somewhere else would be better. Thanks in advance.
edit2: Also, if you know of a good paper but there is a pay wall, don't worry. Just link it as I can get access to almost any publication through our library.
edit3: Thanks all! This was extremely helpful.
[link][44 comments]