An idea I have for improving ensemble and neural network models. May work well with nonconvex optimization.

This is an idea I had today, I figured I'd post it here. If it's been done before or wouldn't work, let me know.

The TL;DR is to add a term to the cost function of a model to prevent ensembles from learning the same representation of the data. I've thought about this for ensemble models and for neural networks (connected ensembles), and I think it could work for both.

If the ensembled models are parameterized by some vector theta, then we can tell how similar two models are by using a similarity measure on their parameter vectors. RBF, cosine, squared difference, whatever. Then what we do is penalize the model for having a larger similarity measure with the other models by introducing a new term to the cost function. For each model, this could be some constant times the sum of similarity measures with all the other models. If we don't want the O( n² ) complexity, maybe we sample the other models at random, or choose a few at the beginning such that through multiple time steps each model can affect the others.

This would also work for neural nets, where we use the similarity penalty to prevent feature detectors from learning the same features. The reason I mentioned nonconvex optimization is because forcing the models to find different local minima could make for a better exploration of the feature space.

Has this been investigated before? If not, I suppose I could put a test together to see how it impacts model performance, and whether the cost in training efficiency is worth it.

submitted by dhammack
[link][2 comments]