I'm trying to do a text classification task.
Here are some specs:
- Context file size = 1M+ documents already labeled
- Number of top-labels = 17
- Number of sub-labels = around 130
- Each document is constituted of: a small text representing some retailer's client feedback (about 15-20 words in average) + a number of topics related to the text of the feedback + the sub-label and top-label it belongs to.
I'm using Scikit Learn. For the moment I've tried several things:
- different vectorizers: CountVectorizer (count occurences) or TfidfVectorizer (compute the tfidf)
- different tokenizers: unigrams, bigrams, trigrams
- various algorithms: SVM, Logistic Regression, Multinomial Naives Bayes, Random Forest.
- cascade of classifiers using the hierarchy in the data.
I can't get past 0.57 in accuracy rate (SVM with L1 norm by training on 500k documents) with those parameters... I'm always around 0.5 with those different configurations. And it doesn't really improve after 100k documents.
I'm trying to do some error analysis. I've computed a confusion matrix only with top-labels (test file size = 2k docs, algorithm = SVM):
- top-label : (precision recall f1-score support)
- TL1 0.57 0.25 0.35 16
- TL2 0.00 0.00 0.00 1
- TL3 0.57 0.55 0.56 258
- TL4 0.61 0.47 0.53 277
- TL5 0.46 0.41 0.43 27
- TL6 0.61 0.37 0.46 38
- TL7 0.69 0.31 0.43 35
- TL8 0.84 0.84 0.84 130
- TL9 0.50 0.06 0.11 31
- TL10 0.71 0.63 0.67 111
- TL11 0.64 0.34 0.45 143
- TL12 0.73 0.93 0.82 815
- TL13 1.00 0.17 0.29 12
- TL14 0.72 0.72 0.72 32
- TL15 0.47 0.14 0.22 56
- TL16 0.51 0.88 0.65 81
- TL17 0.80 0.86 0.83 14
- avg / total 0.67 0.68 0.65 2077
As you can see, one top-label creates a lot of confusion. It is the top-label "product" (TL12) which as you can imagine is really frequent in the context file (about 50% of the corpus). Some other top-labels that are also semantically related confuse each other a lot.
Also, when I'm checking the documents that were not classified well, I realize that the classifier often gets it wrong even though a very particular word is appearing in the document such as 'satisfied client' and 'thank you' in a document that should have been classified as 'congratulations, thanks'. I don't really understand why. Too much noise?
Do you have any advice?
[link][4 comments]