They train pretty fast to begin with and, for training, at each new tree node a portion of the observation matrix is evaluated 1 feature at a time. Unlike bagging, the trees can't be trained in parallel, but I would think because a merit function is evaluated 1 feature at a time (i.e. looping through each feature), each feature (i.e. column of the current node's observation matrixe) could be sent to a different GPU and return the position and merit of its best dividing point. Then a simple max on the merit values returned across all the GPUs/features ( vector 1 element per feature as returned by a GPU) does the trick. As I recall, all other operations are insignificant from a computational expense point of view.
I would think this would result in a nearly linear speed increase maxing out at the number of GPU cores or features, which ever is less. I would think, the GPU code would be trivial (e.g. sort a feature vector and eval a merit function on that.) This small routine could be buried in any standard implementation (i.e. it's just a small part of the tree, the rest of the tree code not to mention all the boosting would be generic/nonGPU as desired)
...and yet I can't seem to find a cuda version. Am I missing something.
[link][3 comments]