I've been building my own NN library simply because I find this is the easiest way to learn about things. My approach has been to add every feature under the sun (dropout, momentum, adaptive learning, regularization, GPU, etc...). I've been using/testing it primarily on a Kaggle competition where there is a relatively small amount of labelled data (15k).
So far things work quite well for standard training but I've never successfully trained a network using dropout. With the latter I either get:
1 - Stuck at around 60-70% training/validation error.
2 - Numerical instabilities (NaN's)
3 - Stuck in a bias-dominated regime.
By (3) I mean the network will always select a particular output (from a softmax layer) for all validation cases. The actual choice of output will change each epoch but the network somehow gets stuck being dominated by bias.
I've tried a whole bunch of things:
- Increase learning rate: leads either to (2) or (3). this happens already at smallish rates like 0.5.
- Reclu or channelout layers: problem happens for both
- Momentum: usually makes things worse
- Fan-in regulator: makes some weights vanishingly small (1e-100) and tends to lead to bias domination (but avoids numerical instabilities)
- L2 regulator: doesn't do too much.
- Adaptive per-weights learning rate: tends to lead to more numerical instabilities. this really speeds up normal learning but with dropout it seems to exacerbate (2) and (3).
- Random amount of dropout: between e.g. 0.3-0.5 still gives all the same problems.
I realize that the best thing to do would probably be to retreat to a more standard dataset like MNIST (which I will probably do) but the thing is that without dropout I can acheive 80% accuracy on the validation set (and 100% on the training set) so its clear that there's enough data to do better.
To give more details I'm using a relatively large network (4-5 layers of size 200 with inputs of size ~50 and output ~10) using either Reclu or channelout and without any pretraining.
I refresh the choice of dropout units every minibatch (size 10-150) and I also do 0.2 dropout on the input layer.
I do intend to go back to MNIST and compare to some published results but if anyone can provide some thoughts or inspirations I would really appreciate it!
[link][8 comments]