Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62716

What are the best ways for text classification if the provided training set is very small?

$
0
0

I have 3 categories (low,mid,high), and i have 75;27;12 documents respectively. The documents are fairly short (think around 80-150 words)

I am using WEKA and what I tried to do was the following:

  • reformat using StringToWordVector filter
  • cut down 3-3-3 samples for testing purposes
  • SMOTE the remaining documents to have same number of instances from all three categories (72-72-72 from 72-24-9)

after classifying with several algorithms (iBK, SMO, j48, bayes) i had very little success with testing on the previously extracted set.

The performance was around 20-40%, only because the classifiers tended to classify the documents as "low". I am pretty sure because there simply wasn't enough initial instances of "mid" and "high" so the classifiers couldn't learn a wide enough range of those categories.

Is there anything i could try to improve the performance?

Thanks

submitted by morgandix
[link][3 comments]

Viewing all articles
Browse latest Browse all 62716

Trending Articles