Hi ML community,
I am trying to perform sentiment analysis (positive, negative, neutral) for tweets belonging to a particular domain (say Finance). I have 2 sets of entities belonging to the domain (eg: Finance Organizations). My training set consists of 12k tweets containing mentions of entities in set 1. The first test set is 2.5k tweets of entities in set 1. The second test set is 1k tweets of entities in set 2.
My feature set consists of the standard bag of words from the training set (about 15k), along with some orthographic and lexical features (total 50). These include flags for punctuation, exclamation, capitalization + the number of known positive/negative words from a lexicon.
The problem is that although all data belongs to the same domain, the classifier performs well (accuracy ~80%) on test set 1 and poorly (accuracy ~55%) on test set 2.
I want to know whats causing this difference in performance values through some computation/graphs if possible and how to fix it.
So far, I have performed the following analysis on the features:
use only bag of words features. in this case the performance on test set 1 drops by 5% and on test set 2 improves by 4%
check the average proportion of bag of words features and lexical/orthographic features activated per document for these sets. they all turn out to be pretty close to each other
I use scikit-learn SVM with a linear kernel. Any help or tips are appreciated.
[link][3 comments]