No matter what numbers I try for learning rate, momentum, and batch size, I always reach a point where the error levels off and has a lot of noise. In this particular plot, the learning rate is 0.0001, momentum is 0.95, and batch size is 100. I know that it's possible to do much better because when I do batch gradient descent I get the to fall much farther. Any tips?
EDIT I'm shuffling the training set before each epoch. Here's the graph. http://imgur.com/gallery/1YKV1eF/new
[link][3 comments]