I'm currently working with a dataset of about 1,500 samples in which the labels/response variables take the form of a number of attempts made for that sample and the subsequent number of successes it had.
In other words, you might have Sample 1 be "attempted 25 times and succeeded five times", while Sample 2 is "attempted 12 times and succeeded once".
Unfortunately, the number of times the attempts are made varies greatly. For some samples it may be as high as 300 attempts, while for others it is only a single attempt. For those larger numbers we have a decent amount of confidence in the overall performance of that sample, but for the single attempt samples we have almost none.
When I try to use this data to build a model, the model becomes heavily biased by samples that have single attempts that are successful. So far my solution has just been to drop any samples with less than a minimum number of attempts (I have arbitrarily chosen 10), but this results in a significant loss of samples (around 30%).
Any ideas on how I can better handle this problem?
[link][3 comments]