Edit: forgot to mention: auto-fill missing values in incomplete data for classicifaction/clustering type problems
I don't really encounter this problem in my ML work, but I'm noticing this in Kaggle problems -- basically getting incomplete datasets that need to be sanitized or filled appropriately.
I wrote a few scripts (Python, R) to deal with the basic probs and it occurred to me that this might be useful to others. Wanted to make sure this wasn't done and I'm reinventing the wheel.
Current simple use case: MLFill("incomplete_file.csv") -> ensembles using RandomForestClassifier, goes through csv starting from the most-filled columns and working outwards -> returns "filled_file.csv".
Obviously this has huge garbage in, garbage out potential, but I figured if extended further it could be interesting -- push it as a simple web service, or maybe show confidence levels of prediction. Iono, someone else on Github can figure it out.
Besides, as a half-assed developer I can finally contribute something that's actually more up my alley. It's either this or doing some Kaggle contests I guess. :)
[link][6 comments]