Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63067

Why is it that I get a better accuracy score when using different datasets with a Random Forest classifier?

$
0
0

Why is it that I get better accuracy score on the test set when using a Random Forest classifier on a dataset where the target I am trying to predict is either '1' or '0' and out of the 10,000 data point training set only 20% of the time the target is '1', however, when I use a training set where the target is '1' 50% of the time I get a much lower accuracy score when running the classifier on the test set?

An example would be I am trying to predict if a cars value is greater than $15,000. So with my data I assign a '1' to all cars (data points) that have a value of $15,000 and over and anything less than $15,000 a '0'. So being that 20% of my data points (cars) in the dataset are greater than $15,000 I train on 70% of the data and test on the remaining 30%.

Then say I want to change up what I am trying to predict by doing a second experiment and want to predict if a car's value is greater than $10,000 instead of $15,000 and so I assign a '1' to cars greater than $10,000 and a '0' to everything else. This time my data set turns out that about 50% of the cars are greater than $10,000 therefore 50% of the time the target is '1'.

After doing the 70% train/30% test cross-validation and even 5 folds cross-validation on both cases I get a significantly better accuracy score on the 1st experiment where I am predicting for cars values greater than $15,000 than the second experiment.

Any idea why?

submitted by mrlovell
[link][6 comments]

Viewing all articles
Browse latest Browse all 63067

Trending Articles