Hi r/machinelearning,
I'm experimenting with performing a sentiment analysis (positive/negative classification) on review text for a commercial application.
As a training set I have ~200k labeled reviews from a popular domain specific website. I intend to experiment with training a classifier at the sentence and at the paragraph (or complete review) level.
The data to be classified is ~300k labeled reviews from the same domain. Due to legal reasons I am not able to train my classifier(s) with this data.
The approaches I am considering for constructing the feature vectors include: uni/bigrams, parts of speech filtered uni/bigrams, and parts of speech tagged uni/bigrams.
Anyways, to my questions:
Is it even feasible to train a model with a feature vector so large? Imagine each review is only 100 words, then the feature space of my training set is as high as 20 million dimensions.
If my feature vectors are all essentially bags of words, how can I use a model trained on the words in my training set to classify the words in my test set? That is to say, will there not be an issue with finding overlap between the vocabulary in the review to be classified and the group of reviews used to train the model?
Thanks :)
[link][comment]