I've stumbled on a couple of papers (Sexton et al. 1999) (Wang, Smith, et al. 2000) that seem to praise the practical benefits of Simulated Annealing for training networks.
The supposed benefit? The end solution will better approximate the global minimum.
I'm wondering then why SA isn't more popular? Obviously, it'll often take longer to converge since it's a global search (rather than a highly optimized gradient descent). But it seems like many operations can be highly parallelized. Do huge networks just take too long to train?
[link][4 comments]