I wrote code for an evolutionary algorithm that performs feature selection on a fairly feature rich (300-500 feature) dataset whose fitness function is an SVM.
It functions correctly, but the speed of the algorithm leaves much to be desired. I was wondering what some suggestions for parallelizing or scaling up the code might be. Specifically, in a given generation, I will have 200-400 chromosomes whose fitnesses (area under ROC) need to be determined but can be done so independently.
However, an additional challenge is that the dataset changes each generation (both a resampling and noise perturbation), and thus the data might need to be synchronized if the solution is network based.
The algorithm itself is not my design, and thus I cannot change the way it fundamentally functions to speed it up. I have heard of MPI, OpenMP, and Hadoop, but I would like to get some input before I learn a new technique.
I should note that I have decent experience with C/C++, Java, Python, C#, R, and have done a little with PHP/MySQL a while ago. I dual boot Windows and Linux, and can use both proficiently. I have a hexacore processor in my personal desktop, and I also have access to 10-20 desktop machines (friend's computers).
Any advice?
[link] [comment]