So I saw a talk by Matt Zeiler of Clarifai, "Visualizing and Understanding Deep Neural Networks", which showed some nice visualizations of what the deep layers of convnets are firing on.
According to him, the neurons in the deepest layers act as "object recognizers", corresponding to semantic classes.
One thing he showed/said particularly stood out to me, a "human face" filter which appeared in the deep layers, despite faces not being a class in the ImageNet labels. He said that this is because it is useful in distinguising the class "Neck Brace".
I wonder if this is a problem with the method of training these networks, and the ImageNet dataset in particular: the annotations are ambigous and often arguably incorrect, so the networks may be learning concepts which are only associated with the labels by statistical correlations of this particular dataset, and might generalize poorly to real world images.
I'm sure that the big companies have developed better annotation methods and training sets, but academic efforts still seemed focused on this benchmark. Has progress reached a point where this may be counterproductive.
[link][1 comment]