I encountered a weird behavior while trying to train sklearn's GradientBoostingRegressor and make prediction. I will bring an example to demonstrate the issue on a reduced dataset but issue remains on a larger dataset as well. I have the following 2 small datasets adapted from a big dataset. As you can see the target variable is identical for both cases but input variables are different though their values are close to each other
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | target |
---|---|---|---|---|---|
101869.2 | 102119.9 | 102138.0 | 101958.3 | 101903.7 | 12384900 |
101809.1 | 102031.3 | 102061.7 | 101930.0 | 101935.2 | 11930700 |
101978.0 | 102208.9 | 102209.8 | 101970.0 | 101878.6 | 12116700 |
101869.2 | 102119.9 | 102138.0 | 101958.3 | 101903.7 | 12301200 |
102125.5 | 102283.4 | 102194.0 | 101884.8 | 101806.0 | 10706100 |
102215.5 | 102351.9 | 102214.0 | 101769.3 | 101693.6 | 10116900 |
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | target |
---|---|---|---|---|---|
101876.0 | 102109.8 | 102127.6 | 101937.0 | 101868.4 | 12384900 |
101812.9 | 102021.2 | 102058.8 | 101912.9 | 101896.4 | 11930700 |
101982.5 | 102198.0 | 102195.4 | 101940.2 | 101842.5 | 12116700 |
101876.0 | 102109.8 | 102127.6 | 101937.0 | 101868.4 | 12301200 |
102111.3 | 102254.8 | 102182.8 | 101832.7 | 101719.7 | 10706100 |
102184.6 | 102320.2 | 102188.9 | 101699.9 | 101548.1 | 10116900 |
I have the following code:
re1 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=1) re1.fit(X1,Y) pred1 = re1.predict(X1) re2 = ensemble.GradientBoostingRegressor(n_estimators=40,max_depth=None,random_state=3) re2.fit(X2,Y) pred2 = re2.predict(X2)
where X1 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 1st dataset X2 is a pandas DataFrame corresponding to Column 1 through Column 5 on the 2nd dataset Y represents the target column. The issue I am facing is that I cannot explain why pred1 is exactly the same as pred2?? As long as X1 and X2 are not the same pred1 and pred2 must also be different, musn't they? Help me to find my false assumption, please.
P.S. To help you build the dataframe I wrote this code:
d1 = {'0':[101869.2,102119.9,102138.0,101958.3,101903.7,12384900], '1':[101809.1,102031.3,102061.7,101930.0,101935.2,11930700], '2':[101978.0,102208.9,102209.8,101970.0,101878.6,12116700], '3':[101869.2,102119.9,102138.0,101958.3,101903.7,12301200], '4':[102125.5,102283.4,102194.0,101884.8,101806.0,10706100], '5':[102215.5,102351.9,102214.0,101769.3,101693.6,10116900]} data1 = pd.DataFrame(d1).T X1 = data1.ix[:,:4] Y = data1[5] d2 = {'0':[101876.0,102109.8,102127.6,101937.0,101868.4,12384900], '1':[101812.9,102021.2,102058.8,101912.9,101896.4,11930700], '2':[101982.5,102198.0,102195.4,101940.2,101842.5,12116700], '3':[101876.0,102109.8,102127.6,101937.0,101868.4,12301200], '4':[102111.3,102254.8,102182.8,101832.7,101719.7,10706100], '5':[102184.6,102320.2,102188.9,101699.9,101548.1,10116900]} data2 = pd.DataFrame(d2).T X2 = data2.ix[:,:4] Y = data2[5]
[link][5 comments]