I'm reading Ng et al.'s paper on Deep learning with COTS HPC systems and came across something I don't intuitively understand: when constructing a linear filter layer in a greedy fashion (i.e. via constructing autoencoders for each layer), they use the transpose of the linear layer's weight matrix, W', as the "decoding matrix." Check it out in section 3 of the paper, it is present in the first optimization problem they describe.
I understand the theory behind autoencoders, but can anyone describe intuitively how they are able to get away with using W' instead of a different weight matrix? It seems to me that it constrains W'*W to be close to a multiple of the identity matrix, but I can't visualize what changes this soft constraint introduces into the learned parameters. Thanks in advance!
[link][16 comments]