How should I deal with categorical feature vector with feature groups that have growing number of categories

So, what the title says: How should I deal with categorical feature vector with feature groups that have growing number of categories

Background: As an exercise, I would like to do a binary classification on URLs. To figure out if they are spam or benign.
(Some people say this may not be the best way to detect spam... but thats another issue. This is an exercise more than anything)

So, given a URIs like this:

Ftp://foobarhost:123/some/path

Http://1234Host.com/some/other/path?some=query&string=true

One of the feature groups will be the Scheme (Protocol) i.e FTP | HTTP | HTTPS etc..

I am representing this as a binary categorical feature vector where the value is either 0 or 1:

<FTP, HTTP, HTTPS>

then we have other feature vectors based on the other parts of the URI (host, path, query etc)

Note: The reason we keep separate feature vectors for all of the feature groups is that for example the word Foo means completely different things depending on where in the URI we see it

. . . When we have fully built all of these feature group vectors, we would like to combine all of them to become our input. This works well

BUT, what can we do if for example we come across a new Scheme/Protocol FooBar:// and grow the Scheme feature group? If we grow that feature vector, then when we combine the index where the next feature group started would mean something completely different.

Example: Before growing this is our categorical input vector:

[Scheme][Host bag of words ]

<Http, Ftp, foo, bar, moo, woof,....>

After growing our Scheme feature group this is what it will look like:

[ Scheme ][Host bag of words ]

<Http, Ftp,**HTTPS** foo, bar, moo, woof,....>

So everything after index 2 means something completely different than it used to before growing the feature vector

You might say, just add it to the end of the end of the combined feature vector... but how do you keep track of that particular feature being at that index?

Sorry for the wall of text.. trying to explain the issue the best I can

submitted by kingnebula
[link][1 comment]

How should I deal with categorical feature vector with feature groups that have growing number of categories

Trending Articles

ZARIA CUMMINGS

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Lady Gaga – MAYHEM (Bonus Tracks Version) [iTunes Rip M4A]

FINAL LESSON

Tone2 - FilterBank 3 New V3.4 VST/AU MAC/WIN

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

SHAYNA E JAGGANARINE Arrested by Miami-Dade County Corrections on Feb 06, 2017

3 Extremely pleasurable sex positions for slim women

transaction POxxxx exceeds the budget funds available for dimension value...

PURPLE RANGE LIVE AT GAL AMUNA 2013

Lady Gaga & Bruno Mars – Die With A Smile (Acoustic) – Single [iTunes Plus M4A]

AFMG EASERA SysTune Pro v1.3.7 CE-V.R

Jumping pipe Loan App Customer” Care Helpline Number®️))+91-8452013280 @! 779...

Deeds, July, 28, 2017

Chris Brown – 2008 – Exclusive (The Forever Edition)

ROBERT WSZOLEK Arrested by Cook County Sheriff's Office on Oct 27, 2016

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

Braunstone man Mehran Falsafi made threat to stab his neighbour after noisy...