So in a project I'm currently working on, I am trying to classify statistics from bump test process data to determine if a section of the process data is a good candidate to fit a model to (don't worry about that, I'm classifying stuff is all)
I've been manually looking at model fits and classifying them based upon my knowledge, and I originally had 3 classes (Good, Unclear, Poor) of fits. But I noticed while doing this that there are really more like 5 classes (Excellent/Perfect, Good, Unclear, Poor, and ohmygodgetitawayfromme/Atrocious).
Now ideally I would like to use the 5 classes, but I'm a bit worried that I will be splitting my data up unevenly, and I don't know what effects that will have on my classifier (see below). Is scaling the importance of classes a viable option in this case? Just as an estimate, I have about ~2000 data instances, and probably only ~50-100 will fall in the Excellent class and ~100-200 in Atrocious, where the others will be more evenly distributed. I can generate more data quite easily, but I am working alone, and manually classifying takes quite some time.
I am using a random forest right now, but will probably switch to a neural net later, as we probably can't release our source code, as dictated by the gnu public license that the random forest code I found uses (and I don't know nearly enough about it to write my own version).
Is this something I need to worry about?
[link] [1 comment]