I was reading this paper by xavier glorot, and had some questions about how to implement his methods
http://eprints.pascal-network.org/archive/00008596/01/glorot11a.pdf
1.He says in the section in Experiments that he chose to flip half the unit activations to numerically equalize the layer activation. This is in a layerwise pretraining frame work, so i wanted to confirm that what this means is that whenever we are learning a layer, half of its units are relu units and the rest are negative relu?
2.Also the reconstruction functions seemed to have been problematic for him, and he tries out many methods. One of these is scaling the activations of the hidden layer to 0,1 range and then using sigmoids for the reconstruction.
a. How do we normalize the activations in 0 to 1, is it across a minibatch, or is it across the hidden layer?
b. suppose i call my visible layer v, I learn hidden layer 1, h1, and then want to learn the second hidden layer h2.
So going from v to h1, I use ReLU, but coming down from h2 to h1, I use sigmoid? i.e. the up and down activations of each hidden layer are different?
[link][1 comment]