Convolutional Neural Networks ConvNet ● ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Convolutional Neural Networks Su-A Kim 12th August 2014
Table of contents Introduce Convolutional Neural Networks ConvNet ● ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Table of contents Introduce Convolutional Neural Networks Introduce application paper : “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”, CVPR 2014
Su-A Kim 12th August 2014 @CVLAB ConvNet ● ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ History In 1995, Yann LeCun and Yoshua Bengio introduced the concept of convolutional neural networks. Yann LeCun Yoshua Bengio 1989년은 back-propagation 등 neural network로 해결하려는 시도는 있었지만, 95년부터 CNN 사용
Convolution (Learned) Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ● ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Recap of Convnet Feature maps Pooling Non-linearity Convolution (Learned) Input image Neural network with specialized connectivity structure Feed-forward: - Convolve input - Non-linearity (rectified linear) - Pooling (local max) Supervised Train convolutional filters by back-propagating classification error Convolution 하는 것은 filtering 하는 것과 같음 Slide: R.fergus
Connectivity & weight sharing depends on layer Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ● ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Connectivity & weight sharing depends on layer All different weights All different weights Shared weights Shared weights 같은 색의 weight은 공유되는 weight임 Shared weights하면 좋은 점 ? Input의 위치(rotation, translation…)에 상관없게 feature를 추출할 수 있게 해줌 Convolution layer has much smaller number of parameters by local connection and weight sharing
Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ● ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Convolution layer Detect the same feature at different positions in the input image features Input Filter (kernel) Feature map Slide: R.fergus
Non-linearity Tanh Sigmoid: 1/(1+exp(-x)) Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ● ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Non-linearity Tanh Sigmoid: 1/(1+exp(-x)) Rectified linear (ReLU) : max(0,x) - Simplifies backprop - Makes learning faster - Make feature sparse → Preferred option 하이퍼볼릭 탄젠트 Slide: R.fergus
Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ● ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Sub-sampling layer Spatial Pooling - Average or Max - Boureau et al. ICML’10 for theoretical analysis → Max가 더 좋다는 연구 Role of Pooling - Invariance to small transformations - reduce the effect of noises and shift or distortion Max Sum Slide: R.fergus
Feature maps after contrast normalization Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ● ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ Normalization Contrast normalization (between/across feature map) - Equalizes the features map → Detail하지 않은 feature를 잡아냄 Feature maps Feature maps after contrast normalization Slide: R.fergus
LeNet 5 C1,C3,C5 : Convolutional layer. (5 × 5 Convolution matrix.) Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ● ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ○ LeNet 5 C1,C3,C5 : Convolutional layer. (5 × 5 Convolution matrix.) S2 , S4 : Subsampling layer. (by factor 2) F6 : Fully connected layer. About 187,000 connection. About 14,000 trainable weight.
LeNet 5 노이즈에도 강건 Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ● ○ DeepFace ○ ○ ○ ○ ○ ○ ○ LeNet 5 노이즈에도 강건
About CNN’s A special kind of multi-layer neural networks. Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ● DeepFace ○ ○ ○ ○ ○ ○ ○ About CNN’s A special kind of multi-layer neural networks. Implicitly extract relevant features. A feed-forward network that can extract topological properties from an image. Like almost every other neural networks CNNs are trained with a version of the back-propagation algorithm.
Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ● ○ ○ ○ ○ ○ ○ DeepFace: Closing the Gap to Human-Level Performance in Face Verification Yaniv Taigman, Ming Yang, Marc’ Aurelio Ranzato, Lior Wolf Facebook AI Research, Tel Aviv University 인간은 97.5% 다른 데이터 셋에도 일반화 할 수 있는 face representation (엄청나게 많은 얼굴 데이터셋 사용한 learning 기법 개발) Reach an accuracy of 97.35%
Architecture Face Alignment Representation(CNN) Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ● ○ ○ ○ ○ ○ ○ Architecture Face Alignment Representation(CNN)
Face Alignment (1) 2D alignment (2) 3D alignment 얼굴 영역 검출 후, 기준점 6개 추출 Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ● ○ ○ ○ ○ ○ Face Alignment (1) 2D alignment 얼굴 영역 검출 후, 기준점 6개 추출 기준점 추출 : LBP histogram을 descriptor로 사용해서 미리 학습된 SVR(Support Vector Regressor)로 추출 (2) 3D alignment - 얼굴 영역을 먼저 검출한 후, 영역 내에서 6개(눈, 코, 입)의 fiducial point를 추출해 냄 - Fiducial point를 추출해내기 위해서 몇번의 반복 과정을 통해 refine함 - 각 과정은 image descriptor(LBP Histogram을 사용)로부터 point configuration을 예측하기위해 학습된 SVR(Support Vector Regressor)을 통해 추출해냄 - Induced similarity matrix T로 현재이미지를 새로운 이미지로 변형시켜서, 새로운 이미지에서 fiducial point를 다시 추출해 냄. 이런 방식을 계속 반복해서 정확한 fiducial point의 위치를 찾음 - 이 과정의 결과는 2D-aligned crop image(b)임 67개 landmark Landmark mapping 2D-3D align Frontalization 2D projection
Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ● ○ ○ ○ ○ Representation C1-M2-C3 Input 152x152 Low-level feature 추출 (simple edges and texture) Apply max-pooling only to the first convolution layer, why? 이 세 레이어는 edge와 texture와 같은 low-level feature를 추출하는 것이 목적 Max pooling layer(M2)에 대해서.. convolution network의 결과를 더 robust하게 해줌(더 구체적으로는, aligned facial image에 max pooling을 적용하면 registration error를 작게 해줌) 그런데 첫번째 convolution layer 결과에만 max pooling을 함 : 이유는? pooling을 여러번하면 구체적인 facial structure와 micro-texture의 정확한 위치에 대한 정보를 잃을 수 있기 때문에.. 이 세 레이어가 대부분의 계산에서 큰 부분인데도 파라미터가 적음 input을 그저 단순한 local features로 확장하기만 함
Representation L4-L5-L6 C1-M2-C3 (Locally connected) Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ● ○ ○ ○ Representation 152x152 C1-M2-C3 L4-L5-L6 (Locally connected) Shared weights All different weights Low-level feature 추출 (simple edges and texture) Apply max-pooling only to the first convolution layer, why? Locally connected layer를 사용한 이유? : 각각의 영역들은 서로 다른 local statistic을 가짐 All different weights convolutional layer와 같이 filter bank를 사용하지만, feature map에서 모든 위치는 다른 종류의 filter를 학습함(CNN은 한 feature map에서의 모든 위치는 같은 종류의 filter를 학습) 눈과 눈썹 사이의 공간이 코와 입 사이의 공간과 매우 다른 appearance를 갖고 매우 높은 차별성을 갖고 있는 것과 같이, aligned image에서 각각의 영역들은 서로 다른 local statistic을 갖으니깐 locally layer을사용. (convolution의 spatial stationarity assumption을 만족하지 않음) local layer를 사용하면 feature extraction하는데는 영향을 주지 않으면서, training해야하는 파라미터의 수에 영향을 줄 수 있음
Representation C1-M2-C3 L4-L5-L6 (Locally connected) F7-F8 Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ● ○ ○ Representation 152x152 C1-M2-C3 L4-L5-L6 (Locally connected) F7-F8 (Fully connected) Low-level feature 추출 (simple edges and texture) Apply max-pooling only to the first convolution layer, why? Locally connected layer를 사용한 이유? 얼굴에서 떨어져 있는 부분에서 뽑힌 feature사이의 correlation을 구할 수 있음 Output of F7 : raw face representation feature vector Output of F8 : Class labels의 확률분포를 구하는데 사용됨 얼굴 이미지의 떨어져 있는 부분(눈의 위치와 모양과 입의 위치와 모양과 같이)에서 뽑힌 feature사이의 correlation을 구할 수 있는 레이어임 F7의 결과는 raw face representation feature vector로 사용됨 F8의 결과는 class labels의 확률분포를 구하는 K-way softmax로 보내짐
Training Correct class의 확률을 최대화 하는 것이 목적 Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ● ○ Training Correct class의 확률을 최대화 하는 것이 목적 Back-propagation해서 파라미터를 최소화하고, stochastic gradient descent(SGD)를 사용해서 파라미터를 업데이트 neuron에서 activation function으로 tanh나 sigmoid를 사용하지 않고, ReLU(Rectified Linear Unit)을 사용 - 그래서 이 네트워크로 만들어진 feature들은 매우 sparse
Result Reduces the error of the previous best methods by more than 50% Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ● Result LFW : Labeled Faces in the Wild Database( de facto: 사실적임) YTF : Youtube Faces Reduces the error of the previous best methods by more than 50% Youtube에 100개정도 잘못 라벨링 된 것들이 있어서 그것까지 치면 92.5% 정도 됨
Su-A Kim 12th August 2014 @CVLAB ConvNet ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ DeepFace ○ ○ ○ ○ ○ ○ ● Reference [1] Bouchain, David. "Character recognition using convolutional neural networks.“ Institute for Neural Information Processing 2007 (2006). [2] Bouvrie, Jake. "Notes on convolutional neural networks." (2006). [3] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier networks." Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume. Vol. 15. 2011. [4] Ahonen, Timo, Abdenour Hadid, and Matti Pietikainen. "Face description with local binary patterns: Application to face recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.12 (2006): 2037-2041. [5] Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.