Building high-level features using large-scale unsupervised learning and minimum activation values, then picked 20 equally tested neuron, by solving: spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. x∗ = arg min f (x; W, H), subject to ||x||2 = 1. x 4.3. Recognition Here, f (x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our Surprisingly, the best neuron in the network performs experiments, this constraint optimization problem is very well in recognizing faces, despite the fact that no solved by projected gradient descent with line search. supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in These visualization methods have complementary detecting faces. There are 13,026 faces in the test set, strengths and weaknesses. For instance, visualizing so guessing all negative only achieves 64.8%. The best the most responsive stimuli may suﬀer from fitting to neuron in a one-layered network only achieves 71% ac- noise. On the other hand, the numerical optimization curacy while best linear filter, selected among 100,000 approach can be susceptible to local minima. Results, filters sampled randomly from the training set, only shown in Figure 13, confirm that the tested neuron achieves 74%. indeed learns the concept of faces. To understand their contribution, we removed the lo- cal contrast normalization sublayers and trained the 多層ネットワークの階層構造の例 network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with pre- vious study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face • 特定の物体だけに選択的に反応するユニット images and random images in Figure 2. It can be seen, Building high-level features using large-scale unsupervised learning even with exclusively unlabeled data, the neuron learns to diﬀerentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 4. Scale (left) Figu and re 3. out- To of p - : To pl p 4 ane 8(stim 3D u)lirof t ot he at b i est on n ( eruiron gh f t r)om the test set. Bottom: The optimal stimulus according to nu-Figure 6. Visualization of the cat face neuron (left) and invariance prop Building High-level Features Using Large Scale Unsupervised Learning erties of the best feature. merical constraint optimization.
Figure 2. Histograms of faces (red) vs. no faces (blue). human body neuron (right). The test set is subsampled such that the ratio betwee 引用： n http://static.googleusercontent.com/media/research.google.com/ja// 4.5. Invariance properties faces and no faces is one. pubs/archive/38115.pdf We would like to assess the robustness of the face de- 4.4. Visualization tector against common object transformations, e.g.,For the ease of interpretation, these datasets have a translation, scaling and out-of-plane rotation. First, In this section, we will present two visualization tech- positive-to-negative ratio identical to the face dataset. we chose a set of 10 face images and perform distor- niques to verify if the optimal stimulus of the neuron is tions to them, e.g., scaling and translating. For out- indeed a face. The first method is visualizing the most The cat face images are collected from the dataset de- of-plane rotation, we used 10 images of faces rotating responsive stimuli in the test set. Since the test set in 3D (“out-of-plane”) as the test set. To check the ro-scribed in (Zhang et al., 2008). In this dataset, there is large, this method can reliably detect near optimal bustness of the neuron, we plot its averaged response stimuli of the tested neuron. The second approach are 10,000 positive images and 18,409 negative images over the small test set with respect to changes in scale, is to perform numerical optimization to find the op- Figure 5. Translational3D i r n ot v at ar ion an(F c i e gur p e r 4 o ), p a e nd rti tr esansla of tio t n he(Fig b ur es et 5).6 (so that the positive-to-negative ratio is similar to the timal stimulus (Berkes & Wiskott, 2005; Erhan et al., 2009; Le et al., 2010). In par f ticula eatu rr,ew . e find x-a the xis no is rm in- 6 pixels case of faces). The negative images are chosen ran- Scaled, translated faces are generated by standard bounded input x which maximizes the output f of the cubic interpolation. For 3D rotated faces, we used 10 se-domly from the ImageNet dataset. The results show that the neuron is robust against Negative and positive examples in our human body complex and diﬃcult-to-hard-wire invariances such as dataset are subsampled at random from a benchmark out-of-plane rotation and scaling. dataset (Keller et al., 2009). In the original dataset, each example is a pair of stereo black-and-white im- Control experiments on dataset without faces: ages. But for simplicity, we keep only the left images. As reported above, the best neuron achieves 81.7% ac- In total, like in the case of human faces, we have 13,026 curacy in classifying faces against random distractors. positive and 23,974 negative examples. What if we remove all images that have faces from the training set? We then followed the same experimental protocols as before. The results, shown in Figure 14, confirm that We performed the control experiment by running a the network learns not only the concept of faces but face detector in OpenCV and removing those training also the concepts of cat faces and human bodies. images that contain at least one face. The recognition accuracy of the best neuron dropped to 72.5% which Our high-level detectors also outperform standard is as low as simple linear filters reported in section 4.3. baselines in terms of recognition rates, achieving 74.8% and 76.7% on cat and human body respectively. In comparison, best linear filters (sampled from the train- 5. Cat and human body detectors ing set) only achieve 67.2% and 68.1% respectively. Having achieved a face-sensitive neuron, we would like In Table 1, we summarize all previous numerical re- to understand if the network is also able to detect other sults comparing the best neurons against other base- high-level concepts. lines such as linear filters and random guesses. To un- We observed that the most common objects in the derstand the eﬀects of training, we also measure the YouTube dataset are body parts and pets and hence performance of best neurons in the same network at suspected that the network also learns these concepts. random initialization. To verify this hypothesis and quantify selectivity prop- During the development process of our algorithm, we erties of the network with respect to these concepts, also tried several other algorithms such as deep autoen- we constructed two datasets, one for classifying hu- coders (Hinton & Salakhutdinov, 2006; Bengio et al., man bodies against random backgrounds and one for 2007) and K-means (Coates et al., 2011). In our im- classifying cat faces against other random distractors. plementation, deep autoencoders are also locally con- nected and use sigmoidal activation function. For K- quences of rotated faces from The Sheﬃeld Face Database – means, we downsample images to 40x40 in order to http://www.sheﬃeld.ac.uk/eee/research/iel/research/face. lower computational costs. Diﬀerent sequences record rotated faces of diﬀerent indi- viduals. The dataset only contains rotated faces up to 90 We also varied the parameters of autoencoders, K- degrees. See Appendix F for a sample sequence. means and chose them to maximize performances