I have 3 categories (low,mid,high), and i have 75;27;12 documents respectively. The documents are fairly short (think around 80-150 words)
I am using WEKA and what I tried to do was the following:
- reformat using StringToWordVector filter
- cut down 3-3-3 samples for testing purposes
- SMOTE the remaining documents to have same number of instances from all three categories (72-72-72 from 72-24-9)
after classifying with several algorithms (iBK, SMO, j48, bayes) i had very little success with testing on the previously extracted set.
The performance was around 20-40%, only because the classifiers tended to classify the documents as "low". I am pretty sure because there simply wasn't enough initial instances of "mid" and "high" so the classifiers couldn't learn a wide enough range of those categories.
Is there anything i could try to improve the performance?
Thanks
[link][3 comments]