Any advice on how to improve my accuracy rate in text classification?

I'm trying to do a text classification task.

Here are some specs:

Context file size = 1M+ documents already labeled
Number of top-labels = 17
Number of sub-labels = around 130
Each document is constituted of: a small text representing some retailer's client feedback (about 15-20 words in average) + a number of topics related to the text of the feedback + the sub-label and top-label it belongs to.

I'm using Scikit Learn. For the moment I've tried several things:

different vectorizers: CountVectorizer (count occurences) or TfidfVectorizer (compute the tfidf)
different tokenizers: unigrams, bigrams, trigrams
various algorithms: SVM, Logistic Regression, Multinomial Naives Bayes, Random Forest.
cascade of classifiers using the hierarchy in the data.

I can't get past 0.57 in accuracy rate (SVM with L1 norm by training on 500k documents) with those parameters... I'm always around 0.5 with those different configurations. And it doesn't really improve after 100k documents.

I'm trying to do some error analysis. I've computed a confusion matrix only with top-labels (test file size = 2k docs, algorithm = SVM):

top-label : (precision recall f1-score support)
TL1 0.57 0.25 0.35 16
TL2 0.00 0.00 0.00 1
TL3 0.57 0.55 0.56 258
TL4 0.61 0.47 0.53 277
TL5 0.46 0.41 0.43 27
TL6 0.61 0.37 0.46 38
TL7 0.69 0.31 0.43 35
TL8 0.84 0.84 0.84 130
TL9 0.50 0.06 0.11 31
TL10 0.71 0.63 0.67 111
TL11 0.64 0.34 0.45 143
TL12 0.73 0.93 0.82 815
TL13 1.00 0.17 0.29 12
TL14 0.72 0.72 0.72 32
TL15 0.47 0.14 0.22 56
TL16 0.51 0.88 0.65 81
TL17 0.80 0.86 0.83 14
avg / total 0.67 0.68 0.65 2077

As you can see, one top-label creates a lot of confusion. It is the top-label "product" (TL12) which as you can imagine is really frequent in the context file (about 50% of the corpus). Some other top-labels that are also semantically related confuse each other a lot.

Also, when I'm checking the documents that were not classified well, I realize that the classifier often gets it wrong even though a very particular word is appearing in the document such as 'satisfied client' and 'thank you' in a document that should have been classified as 'congratulations, thanks'. I don't really understand why. Too much noise?

Do you have any advice?

submitted by orangejaipur
[link][4 comments]

Any advice on how to improve my accuracy rate in text classification?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List