Hi all. I have been working with large nns lately and have been using batch stochastic gradient descent to train them online using live data with fixed a learning rate.
I am seeing fairly slow convergence rates which is consistent with my experience with fixed learning rate learning. In my experience, I have seen hieruistic methods like iRPROP work much better since they optimize the learning rate on a parameter-by-parameter basis. I have found they converge significantly faster.
I am wondering if anyone has heard of using resilient propagation used in conjunction with batch SGD?
I would imagine running irprop for a while on the batch, then incorporating like:
g1 = ((1-lr) * g0 + (lr * g_irporp))
where g1 is the updated gradient, g0 is the orig. gradient, and g_irprop is the gradient obtained after some amount of irprop.
lr is the "learning rate" which will get implicitly scaled by the per-variable parameters learned using resilient propagation... So this seems "smarter" than pure SGD, but would have some issues perhaps because irprop is being run on fairly small batches.
Thoughts?
[link][3 comments]