Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62956

What is the best loss function to use when comparing discriminative predictive (probabilistic) models that apply to severely imbalanced data sets?

$
0
0

Hi,

I am try to identify the best predictive model that yields accurate probabilities of positives (the data's labels are binary) on a severely imbalanced data set.

I have 2 main problems.

  1. Different loss functions tell me different models are better. In particular I get a 2-5% better ROC-AUC loss if I go with model 1 over model 2. However, I get a 7 to 20% better Log loss if I go with model 2 over model 1. The RMSE is 1% better for model 2. I realize log-loss and rmse are "well calibrated", however because the data is so imbalanced, I am concerned that these loss functions will be more impacted by noise in the data. Also, I do care to some degree about how well the predictions are ordered (another reason to favor AUC).

  2. I am not sure I am doing Log and RMSE loss calculations correctly. Because the data is severely imbalanced (the ratio of positives to negatives is about 1 to 1000), the data i work with is pre-filtered before I get it to remove 99 out of every 100 negative training instances (and I can't get the removed instances later). During training time, I give all instances equal weight, but when I calculate model losses during cross validation, I give negative instances a compensatory weight of 100 and positive instances a weight of 1. My concern is that this weighting scheme is wrong for a number of reasons. For one I am concerned it will encourage overfitting to the few negative examples that didn't get removed by downsampling. Second, I worry that it only makes sense to evaluate a model on a test data set that uses the same weighting scheme as the training data.

Any thoughts and advice would be greatly Appreciated. Alex

submitted by AlexTHawk
[link][4 comments]

Viewing all articles
Browse latest Browse all 62956

Trending Articles