This may be naive on my part, but I thought I'd ask anyway...
One of the steps of gradient boosting algorithms is to form the "pseudo-residuals" by taking the partial derivative of the Loss function with respect to f(x) and evaluating at the current value of f(x) like so http://upload.wikimedia.org/math/0/b/e/0bebe45631e9a1c4ed693590d60829c0.png
My question is: why wouldn't you just fit a new model to the actual residuals (the difference between the current train predictions and the train labels)? Why wouldn't that work well?
[link] [2 comments]