Stochastic gradient descent outperforming L-BFGS

I've been hitting my head against this problem for a while now, and I'm about to give up and just use the method which I have that performs. However, I think I also have evidence that something about my implementation is broken. I'm asking for help here because my options for soliciting feedback/advice are pretty limited, so apologies for the multiple posts on the same subject matter.

Here's an album of some simple experimental results based on building a 25 hidden layer unit autoencoder, and training it with 8x8 grayscale images from Bruno Olshausen's whitened natural images dataset:

http://imgur.com/a/zuzJO

Ideally, such an autoencoder should resolve 25 edge detectors in this configuration. The first image shows this, and it's the result of training the network with "stochastic gradient descent", i.e. simple fixed-step gradient descent wherein the batch size is low (100 training examples), and only one step is taken per batch. The second figure shows the objective function versus the training iteration, and you can see the random walk downwards over 24,000 batch iterations. This took a little over 2 minutes to run.

The last picture is a typical example of the results I get from running any of three algorithms in a more typical fashion (i.e. with a batch size equal to the training set size, with multiple steps taken on the batch). Both L-BFGS and Conjugate Gradient Descent manage to quickly (within 50 iterations) find a minima on the order of 0.5 (equivalent to the finishing value of stochastic gradient descent), but the result looks like the third figure. Standard gradient descent with a large batch also does this. L-BFGS in particular (I'm using the implementation from the RISO project) will iterate a few times and then fail when it has a nonzero gradient but ends up taking a step of length 0.

My gradient calculation has been tested and I have high confidence that it is working properly. My objective function calculation seems to be the only thing separating CGD and L-BFGS from fixed-step gradient descent, but I've been staring at it for many hours now and it just isn't complex enough to convince me that there's a bug hidden in there. I would blame the data, but this exact experiment is solved using L-BFGS in Andrew Ng's tutorial here.

I'm about to use this code on some much larger experiments and I don't want to start off with a buggy implementation, but I can't nail down where my method might be diverging from Ng's example. Any thoughts or suggestions would be appreciated.

submitted by eubarch
[link][28 comments]

Stochastic gradient descent outperforming L-BFGS

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112