Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63164

Mixture of Gaussians with TFIDF sparse vectors

$
0
0

Hi guys,

I'm a complete newbie when it comes to Machine Learning (and CS in general, I've only had 2 semesters worth of courses). I'm trying to write an algorithm for document classification for an internship and I'm feeling out of my league.

Right now I've got approximately 2000 documents I need to classify and that number is expected to grow over time. I've got tfidf weightings for each documents, so right now I'm trying to write a Mixture of Gaussians mixture model using the sparse vectors of each documents tfidf weighting (right now there are about 44000 unique words after normalization, so that's how many dimensions I've got).

Things seem to be blowing up, though--I can't reasonably computer a 44000*44000 covariance matrix for each gaussian per iteration, so I'm just doing diagonal matrices (the variance of each term). But then the variance turns out to be so small that when its time for doing the exp() part of this function: http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Non-degenerate_case, multiplying by the inverse eventually makes the resulting scalar too huge to compute. Right now I'm trying to standardize my ifidf scores by subtracting the mean and dividing by the standard deviation, but that hasn't seemed to help with the resulting scalar size.

I really don't know what I'm doing. Is MoG the wrong approach to this? Apparently LDA is the best thing for this sort of thing, but from what I understand that would definitely be out of my league.

I guess don't know what I'm asking, exactly, but if anyone could provide some insight as how I'm approaching this incorrectly or what might be a better strategy, I would really appreciate it.

submitted by ColonelHapablap
[link] [12 comments]

Viewing all articles
Browse latest Browse all 63164

Trending Articles