Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62546

How to deal with text input variables so that they can be understood by ML or NN techniques?

$
0
0

I'm gonna list my scenario here because I'm fairly certain someone will say that my question depends on my scenario.

I have a database of orders that is pretty messy. I would like to clean it up by using some ML/NN techniques to categorize each order by who was the buyer.

There are ~7500 known buyers and if the order doesn't come from one of the 7500, then I am hoping my ML/NN categorizes the order as 'other'.

The variables come from a form that is filled out. So I have the following details:

  • 1) the Buyer's name
  • 2) the Shipping Company they will use
  • 3) an identifier for the order
  • 4) a discount identifier
  • 5) a comment.

Life would be easy if 1) was filled out exactly the same each time. The problem is, that there are many naming variations (i.e. Microsoft vs Microsoft Inc vs Microsoft Inc. vs Microsoft Incorporated vs Microsoft Corp...etc). Also they may put the name of the subsidiary instead of the parent company (John Hancock vs Manulife). Or they may use a prior outdated name (Blackberry vs Research in Motion). And in addition to all that, there can and often are spelling errors.

Variable 2 and 5 are usually useless except sometimes, whoever filled out the form, will accidentally put the shipping name in 1) and the Buyer's name in 2. Sometimes they leave both 1 and/or 2 blank (or filled with 'n/a') and put the company name in 5. Therefore variables 2) and 5) serve only when the Buyer's name is not in 1.

3) & 4) are random letter & digit combinations. However, despite being unpredictable, two orders with the same identifier or discount identifier Must be the same company. Therefore if 1 order shows blank and the same identifier exists on another order that shows Manulife, then the ML program should be confident the companies are the same and predict Manulife.

That's the input data I am working with. Variables 1 is the most important and if I can just figure out how to represent the text in a meaningful way, I think the rest of it would follow.

Would it make sense to use a number representing the ASCII character code? Such that microsoft would be represented by [109 105 99 114 111 115 111 102 116]? Do you see any issues with that? Alternatively can you suggest an approach and I will research it more in depth?

submitted by myMLaccount
[link][11 comments]

Viewing all articles
Browse latest Browse all 62546

Trending Articles