I am comparing two generative models in a cross validation setting, using the posterior probability of held out data as a metric of performance, i.e. P(Held out data| model 1) vs. P(Held out data| model 2). What is the appropriate statistical test for this? I have used a paired t-test and Wilcoxon with a null model of equal mean probability, but I'm wondering if a likelihood ratio test is more appropriate. Thoughts?
[link][comment]