This seems to good to be true, but I think I've found a way to make the input attributes in a dataset statistically independent, making any dataset meet the assumptions of naive bayes, and it's much simpler than a bayesian network learner

I wanted to get some preliminary feedback about this before spending (and possibly wasting) a few days implementing it.

As you know, a Naive Bayes Classifier makes the assumption that all input attributes are statistically independent (meaning that given the value of one attribute, you can't predict anything about the values of the other attributes). This is almost never true, but in many situations Naive Bayes works reasonably well despite this.

The typical solution is to use a Bayesian network learner which captures the interdependencies between attributes, but this is far more complicated than Naive Bayes.

I think I've thought of an alternate approach using a technique from economics for removing "selection bias" from a dataset.

Let's say we have 4 nominal input attributes, A, B, C, and D, and an output attribute Z. We don't know the relationships between the input attributes, but they are probably somewhat dependent on each other.

My proposed approach is to effectively "filter" the interdependence out of the input attributes. How?

Let's take A and B first. If A and B were independent, then knowledge of A's value would not affect the probabilities of the various values that B might take.

By looking at the data we can see the impact that A has on B. For example, we might see that if A is "dog", then the likelihood of B being "house" is 0.3, but if A is "cat", then the likelihood of B being "house" is 0.4.

We can view this as there being a selection bias for the value of B, and economics gives us a well-understood way to remove this bias called Heckman correction.

While the theory behind it is more complicated, applying this correction is very simple. We take the probability of B having it's current value given A's current value - P(B|A), and we weight that sample by 1/P(B|A) (setting a maximum weight of, say, 20 - to guard against very small values for P(B|A) screwing things up).

Note that these weights only apply when calculating probabilities for the attribute B, so the weight is associated with this specific attribute, not with the entire sample (which is more common).

So now in our dataset we have weighted our attribute B such that it is independent from A. Next we want to do the same thing for the attribute C, but in this case we need to weight its samples by the probability that C will have it's value given A and B's values, or 1/P(C|A,B). Since A and our corrected B are now statistically independent, we can safely use a naive bayes classifier with A and B as inputs and C as the out to determine this weight for each sample.

And then once we've got weights for all the C attributes, we can repeat this for attribute D, assigning weights to each of its samples using naive bayes with A, B, and C as inputs, and D as the output.

Finally, we can use Naive Bayes to predict our output attribute Z using A, B, C, and D as inputs - and we are now statistically justified in doing so because we know that A, B, C, and D are now independent. The result, I would hope, would be superior predictive performance.

This seems a bit too good to be true though, where have I screwed up?

submitted by sanity
[link] [3 comments]

This seems to good to be true, but I think I've found a way to make the input attributes in a dataset statistically independent, making any dataset meet the assumptions of naive bayes, and it's much simpler than a bayesian network learner - tell me I'm wrong...

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112