Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63107

I tried a variety of incrementally different neural network training techniques on a set of image data, and put together an album of the results.

$
0
0

I thought this might be helpful to some people. Maybe some others can comment on why certain techniques converge and others fail to do so (in particular, networks with ReLU units).

First, the data: I'm using a set of 8x8 whitened natural image patches. You can get them from Bruno Olshausen's website:

http://redwood.berkeley.edu/bruno/sparsenet/

In addition to the whitening, I scaled and offset these images so that their values fell within the range of 0.2 - 0.8.

Edit: This statement is only partially correct. The data was scaled so that (if I recall correctly) it has a mean of 0.5 and a standard deviation of 0.1 (meaning that the data falls within 0.2-0.8 to within three standard deviations). Here is a histogram.

All of these experiments involve an autoencoding task, where the network is asked to simply reproduce the input (the image data is vectorized, making a single 64-dimensional vector).

You can see all the results in an album here:

http://imgur.com/a/8g9ST

Let's begin with regular old backpropagation, as you can find in books about neural networks from the 90's. The network structure is 64-25-64, meaning the input and output layers have 64 neurons, and in between is a single 25-neuron hidden layer. The neurons have a sigmoidal activation function. We'll train it using Stochastic Gradient Descent, with the following parameters:

  • 3000 iterations with a learning step of 1.0.

  • 3000 iterations with a learning step of 0.1.

  • Minibatches of 100 examples.

  • Five gradient descent steps per minibatch.

What comes out is a network that has an average per-pixel error of 0.031, which is pretty good. However, if we look at Figure 1 on the image album, you can see a visualization of the weight matrix between the input and first hidden layer (each image is made up of the weights that are incident on a single neuron). They are more structured than noise, but they still don't tell us much.

Next, let's add a sparsity metric to the model, as well as some weight regularization. Everything else stays the same, but now we penalize neurons for having activations that deviate from a target value (in our case, 0.01), and also penalize weights for growing too large. The technique used below is detailed here:

http://deeplearning.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

See Figure 2 for a picture of the weights that emerge after training. Average error per pixel is 0.88, so we've lost some performance. The weights seem to be trending toward edge detectors, but they aren't there yet. What happens if we keep everything the same, but train the model for twice as many iterations (6000 at each step size)? See Figure 3. That's more like it. Average error is now 0.083. Getting better.

So, great. Can we do better? It turns out we can. RMSProp is a variant of SGD, and when combined with Nesterov momentum, can converge very quickly. Figure 4 shows the result we get when we train the same network, with the same sparsity and weight regularization parameters, for 3000 iterations of RMSProp. Average error is 0.076. That's the best so far.

Can we do better still? There are three things left to try that involve eschewing the idea of sparsity penalties during training altogether. A technique called Dropout involves simply removing half the neurons during each minibatch. Figure 5 shows the result we turn off all the sparsity and regularization, and implement dropout. It looks a little like Figure 1. The average error per pixel is 0.069. We're still using sigmoidal units and SGD, and we're still training from 6000 iterations using a learning step of 1.0 and another 6000 steps at 0.1. This is the best performance observed for a model that is attempting to be sparse, but we don't see edge detectors in the weights. Why?

Next, let's add a different type of weight regularization. Instead of simply penalizing weights for growing too large, let's threshold weights based on the L2 norm of all the weights feeding into a hidden neuron. If the L2 norm passes a threshold, the weights are scaled so that the threshold is no longer broken. Figure 6 shows the result when we threshold the weights to 1.0. These look much more like edge detectors. The average error per pixel for SGD, dropout, and weight regularization is 0.087. Dropout seems to have a strong interacting with weight regularization.

Finally, we can try using Rectified Linear Units (ReLU) instead of sigmoids. Figure 7 shows the network trained with SGD, dropout, L2-norm weight regularization, and ReLU units. Not so good anymore. The average error per pixel is 0.41-- The output is barely better than noise.

What went wrong? Maybe the regularization threshold was set too low. Figure 8 shows the result when the threshold is set to 5.0. Almost identical. The error is still 0.41.

Maybe ReLU units are only effective on large, deep networks and not small ones? Figure 9 shows the result when the network size is changed from 64-25-64 to 64-50-50-50-64 (other than that, the same configuration as figure 7 is used). The weights between the input and first hidden layer are shown. It looks like a few copies of one edge detector might be emerging, but it doesn't look like very much is being learned.

So what is the story with ReLU units? They're supposed to be the state of the art for deep learning, but in this case they don't seem to be able to learn much at all. Are they being used improperly in the examples above?

Edit: I recalled the way that the data was preprocessed incorrectly. See above for correction.

submitted by eubarch
[link][15 comments]

Viewing all articles
Browse latest Browse all 63107

Trending Articles