Deep Convolutional Nets Jiaxin Shi Tsinghua University 11th March 2015
Deep Convolutional Nets A Brief Introduction to CNN The replicated feature approach Use many different copies of the same feature detector with different positions. – Replication greatly reduces the number of free parameters to be learned. Use several different feature types, each with its own map of replicated detectors. – Allows each patch of image to be represented in several ways. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN What does replicating the feature detectors achieve? Equivariant activities: Replicated features do not make the neural activities invariant to translation. The activities are equivariant. Invariant knowledge: If a feature is useful in some locations during training, detectors for that feature will be available in all locations during testing. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Pooling the outputs of replicated feature detectors Get a small amount of translational invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level. – This reduces the number of inputs to the next layer of feature extraction, thus allowing us to have many more different feature maps. – Taking the maximum of the four works slightly better. Problem: After several levels of pooling, we have lost information about the precise positions of things. – This makes it impossible to use the precise spatial relationships between high-level parts for recognition. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Terminology Kernel: 5x5 Image Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Terminology Stride: 2 Kernel: 5x5 Image Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Terminology Padding: 1 Stride: 2 Kernel: 5x5 Image Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Feature map 0 Terminology Padding: 1 Stride: 2 Feature map 1 Kernel: 5x5 Feature map 2 Feature map 3 Image Convolution Layer (5x5, 2, 1, 4) (kernel size, stride, padding, number of kernels) Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Feature map 0 Terminology Feature map 1 Feature map 2 Feature map 3 4 feature maps Pooling Layer (4x4, 4, 0) (pooling size, stride, padding) Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Layer 1 output Channel: 3 An example – a ‘VW’ detector \ V / ^ Input Image Channel: 1 Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Layer 1 output Channel: 3 Layer2 Filter (detector): 2x3 Output Channel: 2 An example – a ‘VW’ detector \ ‘V’ detector V / ‘W’ detector ^ Input Image Channel: 1 Layer1 Filter (detector): 3 Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets A Brief Introduction to CNN Layer 1 output Channel: 3 Layer2 Filter (detector): 2x3 Output Channel: 2 An example – a ‘VW’ detector \ ‘V’ detector W / ‘W’ detector ^ Input Image Channel: 1 Layer1 Filter (detector): 3 Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets History 1979, Neocognitron (Fukushima), the first convolutional nets. Fukushima, however, did not set the weights by supervised backpropagation, but by local unsupervised learning rules. 1989, LeNet (LeCun), BP for Convolutional NNs. LeCun re-invented CNN with BP. 1992, Cresceptron (Weng et al., 1992), Max Pooling. Later integrated with CNN (MPCNN). 2006, CNN trained on GPU (Chellapilla et al., 2006). 2011, Multi-Column GPU-MPCNNs (Ciresan et al., 2011), superhuman performance. The first system to achieve superhuman visual pattern recognition in the IJCNN 2011 traffic sign recognition contest. 2012, ImageNet Breakthrough (Krizhevsky et al., 2012). AlexNet trained on GPUs won imageNet competition. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Outline Recent Progress of Supervised Convolutional Nets AlexNet GoogLeNet VGGNet Small Break: Microsoft’s Tricks Representation Learning and Bayesian Approach Deconvolutional Networks Bayesian Deep Deconvolutional Networks Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Outline Recent Progress of Supervised Convolutional Nets AlexNet GoogLeNet VGGNet Small Break: Microsoft’s Tricks Representation Learning and Bayesian Approach Deconvolutional Networks Bayesian Deep Deconvolutional Networks Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets AlexNet, 2012 The architecture which made the 2012 ImageNet breakthrough. NIPS12, ImageNet Classification with Deep Convolutional Neural Networks. A general practical guide of training deep supervised convnets. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets AlexNet, 2012 The architecture which made the 2012 ImageNet breakthrough. NIPS12, ImageNet Classification with Deep Convolutional Neural Networks. A general practical guide of training deep supervised convnets. Main techniques ReLU nonlinearity Data augmentation Dropout Overlapping pooling Mini-batch SGD with momentum and weight decay Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets AlexNet, 2012 Dropout Reduce overfit Model Average A Brief Proof 𝑃 𝑎𝑣𝑔 𝑦=𝑘|𝑥 = 𝑖=1 2 𝑁 𝑃 𝑖 𝑦=𝑘 𝑥 1 2 𝑁 = 𝑖=1 2 𝑁 𝑒 𝑤 𝑘 𝑥 𝑖 + 𝑏 𝑘 𝑘 ′ =1 𝐾 𝑒 𝑤 𝑘 ′ 𝑥 𝑖 + 𝑏 𝑘 ′ 1 2 𝑁 ~ 𝑖=1 2 𝑁 𝑒 𝑤 𝑘 𝑥 𝑖 + 𝑏 𝑘 1 2 𝑁 = 𝑒 1 2 𝑁 𝑖=1 2 𝑁 ( 𝑤 𝑘 𝑥 𝑖 + 𝑏 𝑘 ) = 𝑒 𝑤 𝑘 1 2 𝑁 𝑖=1 2 𝑁 𝑥 𝑖 + 𝑏 𝑘 = 𝑒 1 2 𝑤 𝑘 𝑥 𝑖 + 𝑏 𝑘 𝑥 𝑦=𝑎𝑟𝑔𝑚𝑎 𝑥 𝑘 ′ (𝑜𝑢 𝑡 𝑘 ′ ) Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets AlexNet, 2012 Dropout Encourage sparsity Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Carefully designed inception architecture Network in Network Deeply Supervised Nets Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Carefully designed inception architecture Network in Network Deeply Supervised Nets Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Network in Network Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Network in Network Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Deeply Supervised Net associating a “companion” classification output with each hidden layer. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets GoogLeNet, 2014 The 2014 ImageNet competition winner. CNN can go further if carefully tuned. Main techniques Deeply Supervised Net Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Recent Progress of Supervised Convolutional Nets VGGNet, 2014 A simple and always state-of-art architecture compared to GoogLeNet-like structure (very hard to tune). Developed by Oxford (later DeepMind) people. Based on Zeiler & Fergus’s 2013 work. Most widely used now. Small filter (3x3) and small stride (1) Jiaxin Shi 11th March 2015 Tsinghua University
Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Outline Recent Progress of Supervised Convolutional Nets AlexNet GoogLeNet VGGNet Small Break: Microsoft’s Tricks Representation Learning and Bayesian Approach Deconvolutional Networks Bayesian Deep Deconvolutional Networks Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Outline Recent Progress of Supervised Convolutional Nets AlexNet GoogLeNet VGGNet Small Break: Microsoft’s Tricks Representation Learning and Bayesian Approach Deconvolutional Networks Bayesian Deep Deconvolutional Networks Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 Deep layered model for representation learning Optimization perspective Results are better than previous representation learning methods but there is still distance from supervised CNN models. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 𝑋 𝑛, 𝑐 = 𝑘=1 𝐾 𝐷 (𝑘, 𝑐) ∗ 𝑊 𝑛,𝑘 𝐾: number of filters (dictionaries). 𝑋 𝑛, 𝑐 : channel c of the nth image. 𝐷 𝑘,𝑐 : channel c of the kth filter (dictionary). 𝑊 𝑛,𝑘 : sparse, indicates the position and pixel-wise strength of 𝐷 𝑘,𝑐 . Cost function of the first layer 𝐶 1 𝑋 𝑛 = 𝜆 2 𝑐=1 𝐾 0 𝑘=1 𝐾 1 𝑊 𝑛,𝑘 ∗ 𝐷 𝑘,𝑐 − 𝑋 𝑛,𝑐 2 2 + 𝑘=1 𝐾 1 𝑊 (𝑛,𝑘) 𝑝 𝐾 0 : number of channels. 𝐾 1 : number of filters (dictionaries). Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 Stack layer 𝐶 𝑙 𝑊 𝑙−1 𝑛 = 𝜆 2 𝑐=1 𝐾 𝑙−1 𝑘=1 𝐾 𝑙 𝑊 𝑙 𝑛,𝑘 ∗ 𝐷 𝑘,𝑐 − 𝑊 𝑙−1 𝑛,𝑐 2 2 + 𝑘=1 𝐾 𝑙 𝑊 𝑙 (𝑛,𝑘) 𝑝 𝐾 𝑙 : layer l’s number of channels. Learning process Optimize 𝐶 𝑙 layer by layer. Optimize over feature maps 𝑊 𝑙 𝑛,𝑘 . Optimize over filters (dictionaries) 𝐷 𝑘,𝑐 . Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 Stack layer 𝐶 𝑙 𝑊 𝑙−1 𝑛 = 𝜆 2 𝑐=1 𝐾 𝑙−1 𝑘=1 𝐾 𝑙 𝑊 𝑙 𝑛,𝑘 ∗ 𝐷 𝑘,𝑐 − 𝑊 𝑙−1 𝑛,𝑐 2 2 + 𝑘=1 𝐾 𝑙 𝑊 𝑙 (𝑛,𝑘) 𝑝 𝐾 𝑙 : layer l’s number of channels. Learning process Optimize 𝐶 𝑙 layer by layer. Optimize over feature maps 𝑊 𝑙 𝑛,𝑘 . When 𝑝=1, convex. But poorly conditioned due to being coupled to one another by filters. (Why?) Optimize over filters (dictionaries) 𝐷 𝑘,𝑐 . Using gradient descent. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 Learning process Optimize 𝐶 𝑙 layer by layer. Optimize over feature maps 𝑊 𝑙 𝑛,𝑘 . When 𝑝=1, convex. But poorly conditioned due to being coupled to one another by filters. Solution: 𝐶 𝑙 𝑊 𝑙−1 𝑛 = 𝜆 2 𝑐=1 𝐾 𝑙−1 𝑘=1 𝐾 𝑙 𝑊 𝑙 𝑛,𝑘 ∗ 𝐷 𝑘,𝑐 − 𝑊 𝑙−1 𝑛,𝑐 2 2 + 𝑘=1 𝐾 𝑙 𝑥 𝑙 (𝑛,𝑘) 𝑝 + 𝑘=1 𝐾 𝑙 𝑥 𝑙 𝑛,𝑘 − 𝑊 𝑙 𝑛,𝑘 2 2 Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Deconvolutional Networks, Zeiler & Fergus, CVPR 2010 Performance (slightly outperforms sift-based approaches and CDBN) Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Bayesian Deep Deconvolutional Learning, Yunchen, 2015 Deep layered model for representation learning Bayesian perspective Claim state-of-art classification performance using representation learned Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Bayesian Deep Deconvolutional Learning, Yunchen, 2015 𝑋 𝑛 = 𝑘=1 𝐾 𝐷 (𝑘) ∗ 𝑍 𝑛,𝑘 ⨀ 𝑊 𝑛,𝑘 + 𝐸 𝑛 𝑋 𝑛 : the nth image. 𝑍 𝑛,𝑘 : indicates which shifted version of 𝐷 𝑘 ⨀ 𝑊 𝑛,𝑘 is used to represent 𝑋 𝑛 . 𝑊 𝑛,𝑘 : indicates the pixel-wise strength of 𝐷 𝑘 . Compared to the Deconvolutional Networks paper 𝑍 𝑛,𝑘 ⨀ 𝑊 𝑛,𝑘 here is actually an explicit version of sparse 𝑊 𝑛,𝑘 in the 2010 paper. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Bayesian Deep Deconvolutional Learning, Yunchen, 2015 𝑋 𝑛 = 𝑘=1 𝐾 𝐷 (𝑘) ∗ 𝑍 𝑛,𝑘 ⨀ 𝑊 𝑛,𝑘 + 𝐸 𝑛 𝑋 𝑛 : the nth image. 𝑍 𝑛,𝑘 : indicates which shifted version of 𝐷 𝑘 ⨀ 𝑊 𝑛,𝑘 is used to represent 𝑋 𝑛 . 𝑊 𝑛,𝑘 : indicates the pixel-wise strength of 𝐷 𝑘 . Priors 𝑧 𝑖,𝑗 𝑛,𝑘 ~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝜋 𝑖,𝑗 𝑛,𝑘 , 𝜋 𝑖,𝑗 𝑛,𝑘 ~𝐵𝑒𝑡𝑎 𝑎 0 , 𝑏 0 , 𝑤 𝑖,𝑗 𝑛,𝑘 ~𝒩 0, 𝛾 𝑤 −1 , 𝐷 (𝑘) ~𝒩 0, 𝛾 𝑑 −1 𝐼 , 𝐸 (𝑛) ~𝒩 0, 𝛾 𝑒 −1 𝐼 , 𝛾 𝑤 ~𝐺𝑎 𝑎 𝑤 , 𝑏 𝑤 , 𝛾 𝑑 ~𝐺𝑎 𝑎 𝑑 , 𝑏 𝑑 , 𝛾 𝑒 ~𝐺𝑎( 𝑎 𝑒 , 𝑏 𝑒 ) Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Bayesian Deep Deconvolutional Learning, Yunchen, 2015 𝑋 𝑛 = 𝑘=1 𝐾 𝐷 (𝑘) ∗ 𝑍 𝑛,𝑘 ⨀ 𝑊 𝑛,𝑘 + 𝐸 𝑛 𝑋 𝑛 : the nth image. 𝑍 𝑛,𝑘 : indicates which shifted version of 𝐷 𝑘 ⨀ 𝑊 𝑛,𝑘 is used to represent 𝑋 𝑛 . 𝑊 𝑛,𝑘 : indicates the pixel-wise strength of 𝐷 𝑘 . Pooling 𝑆 𝑛, 𝑘 𝑙 ,𝑙 = 𝑍 (𝑛, 𝑘 𝑙 ,𝑙) ⨀ 𝑊 𝑛, 𝑘 𝑙 ,𝑙 Within each block of S(n,kl,l), either all nxny pixels are zero, or only one pixel is non-zero, with the position of that pixel selected stochastically via a multinomial distribution. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Pooling 𝑆 𝑛, 𝑘 𝑙 ,𝑙 = 𝑍 (𝑛, 𝑘 𝑙 ,𝑙) ⨀ 𝑊 𝑛, 𝑘 𝑙 ,𝑙 Within each block of S(n,kl,l), either all nx*ny pixels are zero, or only one pixel is non-zero, with the position of that pixel selected stochastically via a multinomial distribution. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Pooling 𝑆 𝑛, 𝑘 𝑙 ,𝑙 = 𝑍 (𝑛, 𝑘 𝑙 ,𝑙) ⨀ 𝑊 𝑛, 𝑘 𝑙 ,𝑙 Within each block of S(n,kl,l), either all nx*ny pixels are zero, or only one pixel is non-zero, with the position of that pixel selected stochastically via a multinomial distribution. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Learning Process Bottom to top: gibbs sampling and MAP samples selected Top to Bottom Refinement Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Bayesian Deep Deconvolutional Learning, Yunchen, 2015 Intuition of Deconvolutional Networks (Generative) An image is made up of patches. These patches are weighted transformation of dictionary elements. We learn dictionaries from training data. A new image is then represented by position and weights of dictionaries. Intuition of Convolutional Networks We can learn feature detectors for various kinds of patches. Then we use these feature detectors to scan a new image, and classify it based on features (kinds of patches) detected. Both are translation equivariant. Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Representation Learning and Bayesian Approach Performance Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Discussion Deep Supervised CNNs still has limits. Where lies further improvement? Why does bayesian learning of deconvolution representations work much better than those in optimization perspective? Jiaxin Shi 11th March 2015 Tsinghua University
Deep Convolutional Nets Thank you. Jiaxin Shi 11th March 2015 Tsinghua University