I am working on building a robust binary classifier from a data set of around 40,000 samples and for each sample, there are two feature sets.
One feature vector is a set of around 15 related continuous variables which every sample has, and the other feature vector is a set of around 70 categorical variables which the sample may or may not have an entry for at all.
My current approach is to use a combination of domain knowledge and associative rule learning on the categorical variables to form a list of penalize and reward rules. These rules are used either as a preprocess (classify around 10% of the sample that meet strong rules) or a post process (weight the probability score output by a Random Forest or SVM model generated on the continuous data).
Does anyone have some other approaches I might try?
[link][1 comment]