Hello,
I'm trying to solve a problem of recommending/predicting 'my favorite drink' and I'm hoping to get some support from this community.
Problem definition: There are 20 different drinks, e.g. pepsi, coke, fanta, etc. There are millions of customers of a supermarket who were buying those drinks over a period of lets say last three months.
Data definition:
Drinks (id,name): 100:Pepsi,101:Coke,....
Transactions customer_id, list of bought drink ids 1 100,100,100,101,101,101 2 100, 102,106,106,106... ....
Definition of 'my favorite drink' is a bit foggy. We don't have any training data we can learn from, e.g. list of fans for a given drink, the only thing we have are transactions, and customer data (id, age, postcode). Customer may not have a favorite drink and this should be predicted as well.
Those are 4 approaches I came up for predicting 'my favorite drink'.
1) 50% ratio - The drink, I buy the most. If the percentage of a my drink transactions is >50% then this is my favorite drink. Otherwise I don't have a favorite drink.
2) Gini index, more clever version of 50% ratio, If I bought pepsi 4 times, and other 6 drinks once only each, then Pepsi is my favorite drink. Gini index = 1 minus sum of squares of drink probabilities. In this case Gini = 1 - (4/10)2 + 6*(1/10)2. I have a favorite drink if gini is <0.7.
3) Rationale - My favorite drink not necessarily has to be the one I drink the most. For example if I bought 49 CopaCopa drinks and 51 Pepsi drinks, then CopaCopa drink is more likely my favorite one. This is based on observations that customers who buy CopaCopa are more likely to buy Pepsi (because this is generally popular drink), than the other way round. If I buy the same number of unpopular CopaCopa and popular Pepsi drinks then it probably means I'm more likely a fan of CopaCopa.
Method 3a: Naive Bayes Text classifier. For this approach I calculate priors - probability of buying a given drink using Maximum Likelihood based on all customers transactions data, e.g. P(Pepsi)=0.2, P(CopaCopa)=0.02. And then I calculate conditional probabilities of buying a drink given I also bought something else, e.g. P(CopaCopa | Pepsi) = 0.03 and P(Pepsi|CopaCopa) = 0.07.
Customer has a favorite drink if a posterior, e.g P(CopaCopa | Pepsi, Pepsi, CopaCopa, CopaCopa) (probability of being a fan of CopaCopa given I bought both Pepsi and CopaCopa twice) is >50%.
Data for bayes classification (one record for a single drink transaction). Those five records represent a customer who bought three drinks 101,101,102 and a customer who bought two drinks 105: drink_id(prior) all_drink_ids_bought_by_customer_of_this_drink_transaction(prediction record) 101 101,101,102 101 101,101,102 102 101,101,102 105 105,105 105 105,105
Method 3b: Logistic regression, I represent transactions as
Target = transaction drink id, prediction variables = percentages of drinks for a given customer, who placed this transaction, e.g. for a single customer, who bought pepsi, pepsi, and copacopa, we have three classification records (one per transaction):
target, %pepsi, %copacopa, %coke,..... pepsi, 2/3,1/3,0,0,0,0... pepsi, 2/3,1/3,0,0,0,0... copacopa, 2/3,1/3,0,0,0,0...
Customer has a favorite drink if a logistic regression predicts drink with >50% confidence, e.g. I take a customer who is represented by classification record: 0.1(CopaCopa), 0.7(Pepsi),0(Coke)..... I'm fan of Pepsi with a confidence level of 0.64.
I would appreciate any feedback on presented approaches. Maybe there is a better way to address this problem? I would be also glad to hear on some papers describing similar prediction problems in various domains.
Regards.
[link] [6 comments]