I'm trying classify short texts, using nltk and scikit-learn, but I am not sure yet how exactly approach it, and I am looking for advice. A particular text may belong to more then one class, or it may not belong to any. The dataset I have is about 100k items, with relatively small amount of items per category (thousands in few cases, hundreds in many, far less in most). For a given cIass I can easily generate samples of items that should be there, but I am not sure what about counter examples (if I need them). So far I am experimenting with naive Bayes classification, where I train classificator using known sample items and random selection of known items that don't belong to this class, doing this separately for each class. As a result classification works well for things that are good match, but generates lot of false positives. Is there a better way of doing this?
[link][4 comments]