So, what the title says: How should I deal with categorical feature vector with feature groups that have growing number of categories
Background: As an exercise, I would like to do a binary classification on URLs. To figure out if they are spam or benign.
(Some people say this may not be the best way to detect spam... but thats another issue. This is an exercise more than anything)
So, given a URIs like this:
Ftp://foobarhost:123/some/path
Http://1234Host.com/some/other/path?some=query&string=true
One of the feature groups will be the Scheme (Protocol) i.e FTP | HTTP | HTTPS etc..
I am representing this as a binary categorical feature vector where the value is either 0 or 1:
<FTP, HTTP, HTTPS>
then we have other feature vectors based on the other parts of the URI (host, path, query etc)
Note: The reason we keep separate feature vectors for all of the feature groups is that for example the word Foo means completely different things depending on where in the URI we see it
. . . When we have fully built all of these feature group vectors, we would like to combine all of them to become our input. This works well
BUT, what can we do if for example we come across a new Scheme/Protocol FooBar:// and grow the Scheme feature group? If we grow that feature vector, then when we combine the index where the next feature group started would mean something completely different.
Example: Before growing this is our categorical input vector:
[Scheme][Host bag of words ]
<Http, Ftp, foo, bar, moo, woof,....>
After growing our Scheme feature group this is what it will look like:
[ Scheme ][Host bag of words ]
<Http, Ftp,**HTTPS** foo, bar, moo, woof,....>
So everything after index 2 means something completely different than it used to before growing the feature vector
You might say, just add it to the end of the end of the combined feature vector... but how do you keep track of that particular feature being at that index?
Sorry for the wall of text.. trying to explain the issue the best I can
[link][1 comment]