Hello,
Suppose we have the first layer of a neural network h = activation(W*x). In general, I would regularize this part of the network by applying a small amount of dropout to x, constraining the row-norm of the weight matrix W, and applying a small amount of weight decay to W.
However, I'm not sure that this is still the best strategy if x is very sparse. Suppose in the most extreme case that x is a single categorical variables with tens of thousands of possible values. Also it is a practical necessity that the runtime of the training algorithm is proportional to the number of non-zero elements in x rather than the total size of x.
Dropout on the inputs. This is a sparse operation if implemented correctly. However, I am somewhat concerned that it will be too strong and noisy as a regularizer for sparse categorical features (since by definition the values of a sparse categorical feature are not positively correlated, whereas features like adjacent pixels in an image are highly positively correlated).
Weight constraints. Neither Computing the row-norm of the weight matrix nor dividing the weights by the norm are sparse operations. Also, there are a lot of researchers who work on sparse linear models, and I've never seen any of them use weight constraints.
Weight decay. If one adds an L1/L2 penalty to the cost term, then the gradients and the update are not sparse. Alternatively one could have a penalty which only applies when the corresponding x term is non-zero.
Any ideas / experience on what sorts of methods work well?
[link][1 comment]