Hey guys, I'm working through metacademy's roadmap for machine learning. The recommended project after the second book is to work with a dataset of about 500,000 reviews and try to guess a given review's rating from its other attributes. I'm struggling with the size of the dataset, so instead of wasting time on large and possibly inappropriate computations, I figured I'd ask here - what three different ML techniques would you use? Here's an example review:
product/productId: B001E4KFG0 review/userId: A3SGXH7AUHU8GW review/profileName: delmartian review/helpfulness: 1/1 review/score: 5.0 review/time: 1303862400 review/summary: Good Quality Dog Food review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
I figured I'd use a naive bayes classifier on the summary (or the text, but that took too long for me). Another idea was to use the frequent terms as features and then do k-nearest neighbors. My current main struggle is just dealing with the size of the dataset. Trying to turn the summaries into a bag of words took an extremely long time for what looks like a pet project idea.
So, how should I deal with the largeness and which basic techniques should I be using?
Sorry if this question is a little too "do my homework for me", but I hoped that since it was an independent project without any public solution that it'd be okay.
[link][2 comments]