I'm a bit stuck on a problem and it'll be awesome if you can provide some help or opinions on this.
My use case:
My example labeled training data looks like this (# separated, first is the target, 2nd is the training data) White-spaces introduced for readability.
Example dataset on pastebin
Nokia 101 Dual Sim#Nokia 101 Feature Phone Nokia 101 Dual Sim#Nokia 101 Nokia 101 Dual Sim#Nokia 101 Nokia 101 Dual Sim#Nokia 101 Dual Sim Mobile Nokia 101 Dual Sim#Nokia 101 Dual sim Mobile Phone Black with Original Tax Paid Invoice and Express Shipping (Sourced From Brand) Nokia 101 Dual Sim#Nokia 101 (P. Black) Nokia 207 #Nokia Asha 207 Mobile Phone - Orange Nokia 207 #Nokia Asha 207 Mobile Phone - Yellow Nokia 207 #Nokia 207 Nokia 208 #Nokia Asha 208 Mobile Phone - Black Nokia 2610 #Brand New Nokia 2610 GSM 100% Genuine Product.Lowest Price A Must Have One Nokia 301 #Nokia 301 (Black) Nokia 301 #Nokia 301 (White) Samsung ATIV S #Samsung Ativ S I8750 16 GB - Grey Samsung ATIV S Neo#Samsung ATIV S Neo
What I'm trying to do here is to map a product name from the Internet to a product I already know about.
So, the target variable is the product name that I already know.
Now when I give a string like "Nokia 207 Mobile with Dual Sim and FM Radio", I need to get an output that the recognized match is Nokia 207
Problem:
Now since it gets pretty hard to extract features from names found on the Internet, mos tly because people will write anything to name their products, specially on ebay, I got pretty poor results while I tried to tackle this is a multi-class classification problem using a Random Forest Classifier. The toughest part being unable to properly extract the features from the training data. So, in the above dataset you see that I did not do any feature extraction.
I was thinking of using Autoencoders or maybe Restricted Boltzmann Machines to get a solution, but unfortunately I don't have much theoretical knowledge on the same (ML wasn't a part of my coursework at college).
It'd be really helpful if you can provide some insight on how this can be done.
[link][15 comments]