I have a data-set with both labeled and unlabeled examples. Due to my knowledge of the domain I know that some of the examples features greatly affect whether an example is labeled or unlabeled, causing very biased label data and grave errors in prediction. How can I use this knowledge of the correlation between the features and the probability of an example having a label to reduce bias and prediction errors?
[link] [6 comments]