I am working on a model with 400,000 samples where only 4,000 are classified as the event I'm looking to predict (approximately 1 in 400 or 0.25%). Currently, I'm struggling to implement a successful model, however I found some success using logistic regression by setting the class weights inversely proportional to class frequencies in the training set. If you're familiar with sklearn, I am using LogisticRegression(class_weight='auto'). The model does accurately predict approximately 60% of the rare events, which is great, but the number of false positives is enormous. How should I think about reducing the number of false positives?
Can someone provide some guidance as to rare event binary classification problems?
[link][comment]