I am looking into a random forest model that will use both continuous and categorical variables. I have used scikit for forests in the past, but I've never had to use categorical variables. Based on what I've read, these seem to be the immediately obvious pros and cons for the two...
- R can handle categorical variables without any transformation
- R can't handle categorical variables that have more than 32 levels (and I'm using state). The ability to use categorical variables might not matter
- R can choose a multiple levels from a category at a specific branch. If I have a dummy variable for each label in a category, this is not possible
- Creating a single numeric variable for categories in scikit implies ordinality. Creating a dummy variable for each level of each categorical variable could drastically increase data size and dimensionality It seems like there are pros and cons to each.
Unless my understanding of ordinality in integer labeled categories is wrong, it seems like a branch in a scikit can't accomplish the same level of interaction that one in R could because R could split on multiple values of a categorical variable instead of one value at a time. At the same time, I wouldn't even be able to pass a few of my categories into R because of the label limit. With sufficiently deep trees and sufficiently large forests, would this difference in implementation even matter?
Any thoughts would be appreciated. Thanks.
[link][5 comments]