Hi,
I am try to identify the best predictive model that yields accurate probabilities of positives (the data's labels are binary) on a severely imbalanced data set.
I have 2 main problems.
Different loss functions tell me different models are better. In particular I get a 2-5% better ROC-AUC loss if I go with model 1 over model 2. However, I get a 7 to 20% better Log loss if I go with model 2 over model 1. The RMSE is 1% better for model 2. I realize log-loss and rmse are "well calibrated", however because the data is so imbalanced, I am concerned that these loss functions will be more impacted by noise in the data. Also, I do care to some degree about how well the predictions are ordered (another reason to favor AUC).
I am not sure I am doing Log and RMSE loss calculations correctly. Because the data is severely imbalanced (the ratio of positives to negatives is about 1 to 1000), the data i work with is pre-filtered before I get it to remove 99 out of every 100 negative training instances (and I can't get the removed instances later). During training time, I give all instances equal weight, but when I calculate model losses during cross validation, I give negative instances a compensatory weight of 100 and positive instances a weight of 1. My concern is that this weighting scheme is wrong for a number of reasons. For one I am concerned it will encourage overfitting to the few negative examples that didn't get removed by downsampling. Second, I worry that it only makes sense to evaluate a model on a test data set that uses the same weighting scheme as the training data.
Any thoughts and advice would be greatly Appreciated. Alex
[link][4 comments]