Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63278

How to handle highly imbalanced label uncertainty (if that's the right thing to call it)?

$
0
0

I'm currently working with a dataset of about 1,500 samples in which the labels/response variables take the form of a number of attempts made for that sample and the subsequent number of successes it had.

In other words, you might have Sample 1 be "attempted 25 times and succeeded five times", while Sample 2 is "attempted 12 times and succeeded once".

Unfortunately, the number of times the attempts are made varies greatly. For some samples it may be as high as 300 attempts, while for others it is only a single attempt. For those larger numbers we have a decent amount of confidence in the overall performance of that sample, but for the single attempt samples we have almost none.

When I try to use this data to build a model, the model becomes heavily biased by samples that have single attempts that are successful. So far my solution has just been to drop any samples with less than a minimum number of attempts (I have arbitrarily chosen 10), but this results in a significant loss of samples (around 30%).

Any ideas on how I can better handle this problem?

submitted by Omega037
[link][3 comments]

Viewing all articles
Browse latest Browse all 63278

Trending Articles