Splitting 8GB dataset into train/test split in Python

I'm working on the Kaggle Avazu CTR Prediction competition. The training dataset is much larger than what I'm used to dealing with, i.e. bigger than my machine limits. I'm looking for some pointers on how to approach the problem and this forum seems like a better place to pose this than StackOverflow (where I can go for implementation details). I've cross posted this on the Kaggle forums as well, so we'll see if that helps.

Conceptually, here's what I think I have to do 1) Split the training csv into two files using a random 70/30 split, ideally a different split for each training run 2) Use a generator to yield the lines in the resulting files for one-by-one encoding 3) Use another generator to feed the output of the previous generator into sklearn for training / xvalidation. 4) Use another dual generator setup on the test data set to write out predictions

Is there a Pythonic way to handle this problem at scale? Does this put me outside the realm of sklearn and into hand rolled model implementations? Would love any advice that I could get for this problem.

submitted by acpigeon
[link][5 comments]

Splitting 8GB dataset into train/test split in Python

Trending Articles

SANIDAPA LIVE IN HALDADUWANA 2005-06-26

Hizia picha za utupu za meneja wa benki imekaaje?

Black Angus Grilled Artichokes

A Bottle of Dew Class 6 Worksheet English Poorvi Chapter 1

[ROM][UNOFFICIAL][x1s][SM-G980F/DS][10] Resurrection Remix v8.6.6 for Samsung...

Our most epic blog yet, 4 stunning, gorgeous Curvy Kate Star In A Bra...

LC4245W - TOSHIBA LCD TV - POWER SUPPLY SCHEMATIC [Circuit Diagram]

JAVARIS FOSTER Arrested by Miami-Dade County Corrections on Feb 01, 2017

UPDATE: Police charge three men after Chelmsford drugs raid

Afzal Hai Kul Jahan Se Gharana Hussain Ka

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Teen Shot In Miami Drive-By Dies From Injuries

Giorgio Moroder - Music From Battlestar Galactica and Other Original...

'Exceptionally dangerous' rapist Bradley Trengove from Camborne...

Chaoro Lyrics Translation | Mary Kom - Priyanka Chopra

Creating Database from Backup of a Terminated DB System

Tinny — Dzormo (Prod by Hammer)

The 10 Tennessee Cities With The Largest Black Population For 2021

Grimsby school staff resign in sex photo shame

Banks reluctant to lend on 400 Manx homes built in 1970s