I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size..
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector?
Edit:formatting
[link] [8 comments]