I'm working on a program that classifies whether or not a web page is related to Movies or not using Naive Bayes. I got a fairly good result of 85% accuracy but I felt this wasn't high enough and had an idea to just come up with my own simple formula.
My formula has a list of good keywords and bad keywords, and each keyword has a weighted number value. So if the word movie is on a web page it has a high number value of 5 and if a negative word appears like medical it has a negative number value of -5.
All the values are added up and if it is a positive number it's a web page about movies, negative number would mean it is not related to movies.
This may seem over simplified but I tested this on over 300,000 web pages and got a 99% accuracy rate, it basically destroyed the naive bayes on actual data. This way also uses very little memory and can process incredible fast. It seems almost too good to be true and have fears that this may not scale properly.
[link][17 comments]