Hello all, I'm working on a data mining project that ends with me clustering bag-of-words type data.
The majority of the project so far has been pre-processing (the data is an awesome web-crawled data set of tweets from middle eastern countries during the arab spring!). I have a dictionary made of word counts, so I can assign some sort of weight to each word.
I'm getting to the point now where I need to actually cluster the data. The vectors are very sparse (each feature is a word :/ Maybe I should try something else for this? Kernel method to map it onto some subspace??) After alllll the work I've done preprocessing rough, incomplete, arabic/french/english mixtures of tweets I feel like I've got to find SOME algorithm that's more complicated than the k-means that the professor spoon fed us.
Any thoughts? If anyone knows of an algorithm that's particularly good on sparse data, I will upvote you and your family.
[link] [comment]