Why is it that I get better accuracy score on the test set when using a Random Forest classifier on a dataset where the target I am trying to predict is either '1' or '0' and out of the 10,000 data point training set only 20% of the time the target is '1', however, when I use a training set where the target is '1' 50% of the time I get a much lower accuracy score when running the classifier on the test set?
An example would be I am trying to predict if a cars value is greater than $15,000. So with my data I assign a '1' to all cars (data points) that have a value of $15,000 and over and anything less than $15,000 a '0'. So being that 20% of my data points (cars) in the dataset are greater than $15,000 I train on 70% of the data and test on the remaining 30%.
Then say I want to change up what I am trying to predict by doing a second experiment and want to predict if a car's value is greater than $10,000 instead of $15,000 and so I assign a '1' to cars greater than $10,000 and a '0' to everything else. This time my data set turns out that about 50% of the cars are greater than $10,000 therefore 50% of the time the target is '1'.
After doing the 70% train/30% test cross-validation and even 5 folds cross-validation on both cases I get a significantly better accuracy score on the 1st experiment where I am predicting for cars values greater than $15,000 than the second experiment.
Any idea why?
[link][6 comments]