Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63824

Binary feature vector vs. Integer feature vector

$
0
0

I have a very basic, newbie question about binary feature vectors.

Suppose I want to classify 6-letter words into one of two classes (0/1). For a feature vector, I'd like to do one of two things:

 

1) Each letter has it's own binary string associated with it so A->(1,0,0....0), B->(0,1,0,.....0), ... Z->(0,0.......1) so 'BATMAN' would be represented as a 26*6=156 dimensional vector
(0,1,0.....0, //B
1,0,0.....0, //A
0,0...1...0, //T
.............., //M
.............., //A
..............) //N

 

2) Each letter has a single integer representing it's value A->1, B->2, ... Z->26 so 'BATMAN' would be represented as a 6 dimensional vector
(2,1,20,13,1,14)

 

Both representations encode all the information I want it to, but instinct tells me to go with the first representation.
My sense of linear algebra makes me instantly notice that in the first representation, each letter has a linearly independent vector representing it. I'm sure that's relevant, but I'm not sure exactly how.

 

If my alphabet set becomes larger (not just 26 letters, say 26000 letters), then the second representation still stays 6 dimensional, while the first one become crazy large.

 

So my questions are:
1. Is there something fundamentally wrong with the second representation?
2. What could possibly go wrong when training a classifier?
3. Is there any situation where the second representation would be favored over the first representation?

submitted by eptheta
[link][4 comments]

Viewing all articles
Browse latest Browse all 63824

Latest Images

Trending Articles



Latest Images