Consider a standard gradient descent approach to optimize neural networks. In a discussion with colleagues I heard the following statement:
'Local Minima get less of a problem if you increase the dimension of your architecture (more parameter to optimize)'.
The argument is that it is less likely that there is no decrease in the error function in any direction if the parameter space is high, (compared to a low-dimensional architecture) so there should be less local minima.
Now I know that this is not true in general, as one can come up with counter examples. However as a general 'heuristic', is there any truth in that?
[link][4 comments]