Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63488

Proper splitting of data set for Ensemble methods (question)

$
0
0

I have 10,000 documents. Each document has a label (Y) that is either 0 or 1 (the 0-1 split is pretty much 50/50 over my 10,000 documents). Each document has 10 fields. Each field can have any number of words in it. I create 10 feature-spaces by fitting a tf-idf over each field individually over all documents. It looks something like this:

 For f_1: For f_2: 34,974 113,351 ------------------------------ ------------------------------ | | | | | | | | | | | | 10,000 | X_1 | Y | 10,000 | X_2 | Y | | | | | | | | | | | | | ------------------------------ ------------------------------ 

On, and on for each field. Each matrix will have 10,000 rows, but a different number of columns. The Y column is always the same. I'm interested in using each of these matrices as the input to a classifier, and using some ensemble of them to predict the labels Y.

My initial approach was to choose a random 70% of the 10,000 documents and set that as the training set, and then use the other 30% as my predicting set. My plan was to train a logistic regression model on 70% of X1 and then have that model predict the labels of the remaining 30% of X_1 to give me Y'_lr1. I would use the same 70% and train a random forest, and then have the random forest predict the 30% to give Y'_rf1. I would use the same 70%/30% of rows to train/predict a logistic regression and random forest on X_2 through X{10}. In the end I would have some matrix:

 20 ------------------------------------------------------------- | | | | | | 3,000 | Y'_lr1 Y'_rf1 Y'_lr2 Y'_rf2 ... Y'_lr10 Y'_rf10 | Y | | | | | | | ------------------------------------------------------------- 

I then trained a final logistic regression to predict the Y from the 20 Y'. Is this a normal technique? Should I be training many models, and do a further voting stage to get each Y'?

Any help is appreciated. Most of the sources I find talk about drawing N rows with replacement, and then merging those models, but in this case, I have many models training on different features. I don't know how much of a difference this makes.

I'm using sklearn on Python if that makes any advice easier to relay!

Sorry for the length, but I wanted to be detailed. Thanks!

submitted by Serious_is_a_star
[link][1 comment]

Viewing all articles
Browse latest Browse all 63488

Trending Articles