I’m working on a medical dataset with a great deal of randomness. What I’m trying to do is to identify a smaller subset of highly probable classifications for automatic diagnoses, hence the need for high precision. In my use case, it’s fine for the majority of the dataset to be a “don’t know” and require a physician to manually review the case.
I suspect that only a few of my features are highly relevant, so my first approach has been with random forests to identify which those are. However, my first few attempts have been largely unsuccessful, I’m guessing because of the high degree of randomness. The best result I’ve gotten is about 55% accuracy on my test set. I tried restricting my test set to those that had a probability of 60% or greater, and achieved ~63%.
What I’m wondering is, how should I tweak my approach to this problem? Is the bulk of the highly random data in the training set causing problems? Should I be using some sort of abstaining classifier instead? Any general tips or guidance would be much appreciated.
[link][5 comments]