So I have 500 positive labelled lines of text and 500 negative labelled lines of text in two different and I have to build a naive bayes classifier for it.
My proposed course of action is as follows:
1. Divide both the files in 5 parts each [for 5 fold cross validation] 2. Take 1 part of both negative and positive texts, and extract a bag of words from both. 3. Using naive bayes check for word probabilities. 4. Get a threshold for selecting positive and negative. 5. Run it over the rest 4 parts and check for labelling error. 6. Change the initial taken part.
I dont know how should I go about extracting the bag of words and selecting the threshold or even if my approach is right.
I could certainly use help here and also suggestions on python libraries so that I dont have to do all of this manually.
[link][2 comments]