I've been hitting my head against this problem for a while now, and I'm about to give up and just use the method which I have that performs. However, I think I also have evidence that something about my implementation is broken. I'm asking for help here because my options for soliciting feedback/advice are pretty limited, so apologies for the multiple posts on the same subject matter.
Here's an album of some simple experimental results based on building a 25 hidden layer unit autoencoder, and training it with 8x8 grayscale images from Bruno Olshausen's whitened natural images dataset:
Ideally, such an autoencoder should resolve 25 edge detectors in this configuration. The first image shows this, and it's the result of training the network with "stochastic gradient descent", i.e. simple fixed-step gradient descent wherein the batch size is low (100 training examples), and only one step is taken per batch. The second figure shows the objective function versus the training iteration, and you can see the random walk downwards over 24,000 batch iterations. This took a little over 2 minutes to run.
The last picture is a typical example of the results I get from running any of three algorithms in a more typical fashion (i.e. with a batch size equal to the training set size, with multiple steps taken on the batch). Both L-BFGS and Conjugate Gradient Descent manage to quickly (within 50 iterations) find a minima on the order of 0.5 (equivalent to the finishing value of stochastic gradient descent), but the result looks like the third figure. Standard gradient descent with a large batch also does this. L-BFGS in particular (I'm using the implementation from the RISO project) will iterate a few times and then fail when it has a nonzero gradient but ends up taking a step of length 0.
My gradient calculation has been tested and I have high confidence that it is working properly. My objective function calculation seems to be the only thing separating CGD and L-BFGS from fixed-step gradient descent, but I've been staring at it for many hours now and it just isn't complex enough to convince me that there's a bug hidden in there. I would blame the data, but this exact experiment is solved using L-BFGS in Andrew Ng's tutorial here.
I'm about to use this code on some much larger experiments and I don't want to start off with a buggy implementation, but I can't nail down where my method might be diverging from Ng's example. Any thoughts or suggestions would be appreciated.
[link][28 comments]