Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62845

How should I deal with categorical feature vector with feature groups that have growing number of categories

$
0
0

So, what the title says: How should I deal with categorical feature vector with feature groups that have growing number of categories

Background: As an exercise, I would like to do a binary classification on URLs. To figure out if they are spam or benign.
(Some people say this may not be the best way to detect spam... but thats another issue. This is an exercise more than anything)

So, given a URIs like this:

Ftp://foobarhost:123/some/path

Http://1234Host.com/some/other/path?some=query&string=true

One of the feature groups will be the Scheme (Protocol) i.e FTP | HTTP | HTTPS etc..

I am representing this as a binary categorical feature vector where the value is either 0 or 1:

<FTP, HTTP, HTTPS>

then we have other feature vectors based on the other parts of the URI (host, path, query etc)

Note: The reason we keep separate feature vectors for all of the feature groups is that for example the word Foo means completely different things depending on where in the URI we see it

. . . When we have fully built all of these feature group vectors, we would like to combine all of them to become our input. This works well

BUT, what can we do if for example we come across a new Scheme/Protocol FooBar:// and grow the Scheme feature group? If we grow that feature vector, then when we combine the index where the next feature group started would mean something completely different.

Example: Before growing this is our categorical input vector:

[Scheme][Host bag of words ]

<Http, Ftp, foo, bar, moo, woof,....>

After growing our Scheme feature group this is what it will look like:

[ Scheme ][Host bag of words ]

<Http, Ftp,**HTTPS** foo, bar, moo, woof,....>

So everything after index 2 means something completely different than it used to before growing the feature vector

You might say, just add it to the end of the end of the combined feature vector... but how do you keep track of that particular feature being at that index?

Sorry for the wall of text.. trying to explain the issue the best I can

submitted by kingnebula
[link][1 comment]

Viewing all articles
Browse latest Browse all 62845

Trending Articles