Proper splitting of data set for Ensemble methods (question)

I have 10,000 documents. Each document has a label (Y) that is either 0 or 1 (the 0-1 split is pretty much 50/50 over my 10,000 documents). Each document has 10 fields. Each field can have any number of words in it. I create 10 feature-spaces by fitting a tf-idf over each field individually over all documents. It looks something like this:

 For f_1: For f_2: 34,974 113,351 ------------------------------ ------------------------------ | | | | | | | | | | | | 10,000 | X_1 | Y | 10,000 | X_2 | Y | | | | | | | | | | | | | ------------------------------ ------------------------------

On, and on for each field. Each matrix will have 10,000 rows, but a different number of columns. The Y column is always the same. I'm interested in using each of these matrices as the input to a classifier, and using some ensemble of them to predict the labels Y.

My initial approach was to choose a random 70% of the 10,000 documents and set that as the training set, and then use the other 30% as my predicting set. My plan was to train a logistic regression model on 70% of X1 and then have that model predict the labels of the remaining 30% of X_1 to give me Y'_lr1. I would use the same 70% and train a random forest, and then have the random forest predict the 30% to give Y'_rf1. I would use the same 70%/30% of rows to train/predict a logistic regression and random forest on X_2 through X{10}. In the end I would have some matrix:

 20 ------------------------------------------------------------- | | | | | | 3,000 | Y'_lr1 Y'_rf1 Y'_lr2 Y'_rf2 ... Y'_lr10 Y'_rf10 | Y | | | | | | | -------------------------------------------------------------

I then trained a final logistic regression to predict the Y from the 20 Y'. Is this a normal technique? Should I be training many models, and do a further voting stage to get each Y'?

Any help is appreciated. Most of the sources I find talk about drawing N rows with replacement, and then merging those models, but in this case, I have many models training on different features. I don't know how much of a difference this makes.

I'm using sklearn on Python if that makes any advice easier to relay!

Sorry for the length, but I wanted to be detailed. Thanks!

submitted by Serious_is_a_star
[link][1 comment]

Proper splitting of data set for Ensemble methods (question)

Trending Articles

Karimnagar District Police Office Mobile Numbers List in Telangana State

Sarah Samis, Emil Bove III

16 Girls Get Pregnant After A Boy Ejaculated In A Swimming Pool

Read GOS (Generic Object Service) Picture Attachments and Display it into...

Black Angus Grilled Artichokes

Windows Time サービスの ID 36 の警告。これって無視しても大丈夫ですか？

Autodiscover Won't Work - Error Code 600, Invalid request - SOLVED

toyota etios steering sensor 89245-0D070

vCenter at address https://xxx:443/sdk has invalid credentials

Download EFF Album: 12 –“ASINAMALI”

Efendi – Cleopatra – Single [iTunes Plus M4A]

Bureau of Internal Revenue: Regional Offices (Directory)

Lane's Photoshop Master Pack (The Complete Set) - Lane Brown (Painting +...

2-Level Leave Approval (Issue with changing of Leave Request Status from...

Practice Sheet of Right form of verbs for HSC Students

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Reservation in promotion to the PwBD – Clarification on carry forward of...

POST /ipp/printer HTTP.1.1 Content-Length: 179 Content -Type: application/ipp

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise