Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63824

Request for Help: Making Sense of an Unstructured Variable as Well as Categorizing Transactions.

$
0
0

So here is the issue:

I have a dataset of mildly organized transaction data with limited to no demographic information on the purchasers. This is a massive dataset, totaling in the millions. Transactions can be anything purchased online, from shoes to groceries and sex toys. My ultimate goal is to have a program that will classify these transactions into their respective industries, so a TV would be placed into "consumer electronics" and a T Shirt would be placed into Apparel. There will be about 10-15 different possible industries as well as a catch all "Not Applicable" for industries that won't be looked at. After this, I'd like to break down the information to categories within that industry, so the TV that went into Consumer Electronics would also be in the "TV" category.

In regards to item information itself, I have the merchant and the price, but most information on the item is located within a "description" variable. However, the layout of this variable is not structured. So a pair of Nike shoes could show up as "Nike: Size - 9.5: Dunks" Sliver/Grey/Green" or could just read "NIKE". Pretty much, it's a grab bag as to what's there and how it's arranged. However, I have expanded information including brand, features, category and all relevant details for about 1% of the total database for each industry.

I was wondering if anyone had a good idea on where to start for this process?

My current line of thinking would be a Naive Bayes method using the 1% of more detailed data as the training set and classifying the data by industry first, then doing it by category. Another idea was to use an Inverse Document Frequency procedure to help pick apart that description field and hopefully help classify the transaction.

I'm currently looking into R's text miner program as well as a program named DataFlux but was curious if anyone had another idea of a program which may prove useful, a different approach i hadn't thought of or some words of advice in general.

submitted by murdahmamurdah
[link][7 comments]

Viewing all articles
Browse latest Browse all 63824

Latest Images

Trending Articles



Latest Images