In Socher & Manning's Deep Learning for NLP tutorial they give a description of a simple feed-forward network with one hidden layer (slide 49), which is able to learn features that are (to some extent) context-aware.
The idea, briefly, is this: given a sequence of symbols where context matters, concatenate the vector representations of these symbols into a vector x. Make a copy of the sequence, randomly change one of the symbols to corrupt the context, and concatenate those representations to make x_hat. For each input vector, compute a scalar score by doing one feed-forward pass and taking the dot product with an arbitrary vector. (At least I think it's arbitrary, they make no mention of it anywhere, input here is welcome). The score of x is S(x), and for x_hat it's S(x_hat).
Features are then learnt by optimizing a hinge loss: J = max(0, 1 - S(x) + S(x_hat))
I'm trying to implement it from the slides for a related toy problem, but I'm curious about a few things: when you back-propagate unlabeled information, why would you also change the network weights? Aren't we purely interested in learning features that make S(x) > S(x_hat) by some margin?
The next step in the process is back-propagating label information into the representations by using a softmax output layer. Again, what is the point of also changing the weights if we only care about representations? Wouldn't we purely want to optimize representation vectors to result in correct classification, and then do weight learning separately at some later time?
Thanks a bunch. Semi-supervised learning is one of the coolest ideas I've heard about in a while.
[link][comment]