I'm working on the Kaggle Avazu CTR Prediction competition. The training dataset is much larger than what I'm used to dealing with, i.e. bigger than my machine limits. I'm looking for some pointers on how to approach the problem and this forum seems like a better place to pose this than StackOverflow (where I can go for implementation details). I've cross posted this on the Kaggle forums as well, so we'll see if that helps.
Conceptually, here's what I think I have to do 1) Split the training csv into two files using a random 70/30 split, ideally a different split for each training run 2) Use a generator to yield the lines in the resulting files for one-by-one encoding 3) Use another generator to feed the output of the previous generator into sklearn for training / xvalidation. 4) Use another dual generator setup on the test data set to write out predictions
Is there a Pythonic way to handle this problem at scale? Does this put me outside the realm of sklearn and into hand rolled model implementations? Would love any advice that I could get for this problem.
[link][5 comments]