This is a simple question, but for some reason it is throwing me off. By "granularity" I mean level of the data. For example, say in the classic example of spam classification you have data at the email level (IP address of sender, email extension of sender, time of day sent etc..) and data at the individual word level.
So, the prediction you want to make is at the email level, but you have data at the word granularity for email. So, at the email level, do you view each possible word as a categorical variable that takes on {0,1} if its present in a given email? So, say you wanted to take a shot at modeling with logistic regression, does this leave with like 100,000 coded variables (if your dictionary has 100,000 words) for each email?
How does this model stay tractable? Especially if the email level data is significant? Would that data get overshadowed by fitting 100,000 dummy variables?
I'm more asking about this scenario in general, I know that I could just google methods of email classification.
[link][9 comments]