Hey guys,
I am programming a neural network from scratch in Python, and when I do a numerical check of the backpropagation gradient, I see numbers like the following example:
Gradient: -0.0375543629722 Finite difference: 0.0187042723576
The analytical gradient is around twice as big (ignoring the sign) as the finite difference.
I am using the following formula for the numerical gradient:
f(x + eps) - f(x - eps) / 2*eps
For the single unit sigmoid output layer I am using the following analytical gradient: (target - prediction) * activation_of_hidden_unit
It's based on the cross-entropy loss, and uses Stochastic Gradient Descent (maybe it has something to do with it?).
The network trains and generalizes, but I am curious and a little bit worried about this gradient problem.
If you need more details, please tell.
Thank you for your help!
[link][1 comment]