Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62693

High False Negative Rate in Binary Text Classification

$
0
0

I am looking for some advice on how to accomplish a lower false negative rate in a binary classifier. I was taught to always start with Naive Bayes (if it has any hope of being decent) and then move on from there so that's why I have not tried anything more complex like LDA or sLDA etc.

I have a bunch of web pages (3500) to train a classifier for determining safe for work content vs NSFW. I have parsed them into text/unigrams, removed punctuation, stop words, words that only appear once etc. and initially used all of unigrams (37500) as features in a Naive Bayes model. About 65% of the examples are safe for work (negative) and 35% are not safe for work (positive). Each data point is 0 or 1: 1 if the feature appears in the document, 0 else. Then I use the chi-squared feature extraction method to rank features in order of usefulness. Using 5, ..., 2000 features in steps of 5, I used 10-fold cross validation on Naive Bayes to attempt to find the best number of features to use. Unfortunately, the lowest false negative rate I can achieve is 40%, whereas the false positive rate is only 10%.

In this case, a false negative is much much worse as I am classifying pages as safe for work or NSFW. I also tried upweighting the NSFW examples so that based on weights, 50% of the data is safe and 50% is not safe but this did not help.

What are some techniques I can use to get a lower false negative rate? Am I wrong to be starting with Naive Bayes and should I try some other method instead?

submitted by LADataJunkie
[link] [11 comments]

Viewing all articles
Browse latest Browse all 62693

Trending Articles