Understanding Convolutional Neural Networks for Object Recognition

Slides:



Advertisements
Similar presentations
Beyond Linear Separability
Advertisements

Neural networks Introduction Fitting neural networks
ImageNet Classification with Deep Convolutional Neural Networks
Artificial Neural Networks
Machine learning Image source:
Machine learning Image source:
Artificial Neural Networks
Machine Learning Chapter 4. Artificial Neural Networks
Non-Bayes classifiers. Linear discriminants, neural networks.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
EEE502 Pattern Recognition
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Convolutional Neural Network
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Lecture 3a Analysis of training of NN
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Convolutional Neural Networks
Today’s Lecture Neural networks Training
Neural networks and support vector machines
Big data classification using neural network
Deep Residual Learning for Image Recognition
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
Regularization Techniques in Neural Networks
Deep Feedforward Networks
Artificial Neural Networks
Training convolutional networks
Data Mining, Neural Network and Genetic Programming
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.
Generative Adversarial Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
ECE 6504 Deep Learning for Perception
Neural Networks 2 CS446 Machine Learning.
Convolution Neural Networks
Training Techniques for Deep Neural Networks
Convolutional Networks
Deep Belief Networks Psychology 209 February 22, 2013.
Machine Learning: The Connectionist
Neural Networks and Backpropagation
RNNs: Going Beyond the SRN in Language Prediction
Machine Learning Today: Reading: Maria Florina Balcan
Computer Vision James Hays
ECE 471/571 - Lecture 17 Back Propagation.
Introduction to Neural Networks
Image Classification.
CS 4501: Introduction to Computer Vision Training Neural Networks II
Tips for Training Deep Network
Very Deep Convolutional Networks for Large-Scale Image Recognition
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
LECTURE 35: Introduction to EEG Processing
Convolutional networks
Neural Networks Geoff Hulten.
LECTURE 33: Alternative OPTIMIZERS
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (1) Traditional multi-layer perceptrons
Image Classification & Training of Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Reuben Feinman Research advised by Brenden Lake
Introduction to Neural Networks
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Batch Normalization.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Image recognition.
Principles of Back-Propagation
Overall Introduction for the Lecture
Presentation transcript:

Understanding Convolutional Neural Networks for Object Recognition Domen Tabernik University of Ljubljana Faculty of Computer and Information Science Visual Cognitive Systems Laboratory

Visual object recognition How to capture representation of objects in computational/mathematical model? Impossible to explicitly model each one - too many objects, too many variations Let machine learn the model from samples Inspiration from biological (human/primate) visual system Key element: a hierarchy Raw pixels (x,y,RGB/Gray) ? Objects: a category [position or segmentation] Biological plausibility Part sharing between objects/categories  Efficient representation Object/part as a composition of other parts Compositional interpretation Kruger et al, Deep Hierarchies in the Primate Visual Cortex : What Can We Learn For Computer Vision ?, PAMI 2012

Deep learning – a sigmoid neuron Basic element: a sigmoid neuron (improved perceptron from 60‘s) Mathematical form: Weighted linear combination of inputs + bias Sigmoid function: Why sigmoid? Equals to a smooth threshold function Smoothness  nice mathematical properties (derivatives) Threshold  adds non-linearity when stacked Captures more complex representations A probability of a car y =0.78 Image/pixels

Deep learning – a sigmoid neuron Basic element: a sigmoid neuron (improved perceptron from 60‘s) Learning process: Known values: x, y  learning input values Unknown values: w, b  learned output parameters Which w,b will for ALL learning images xn produce its correct output yn? Basically, cost is an average difference between neuron‘s outputs and actual correct outputs A probability of a car y =0.78 Image/pixels

Deep learning – optimization Best solution when cost is lowest, therefore our goal is a minimal C(w,b): Basic optimization problem: When is the function at a minimal point? (high-school math problem) When its derivative is at ZERO. How to find ZERO derivative? Analytically? Need N > num(w) Not possible when stacked Naive iterative approach: Start at random position Find small combination of ∆w to min. C If num(w) big, too many combinations and checks Use gradient descent instead!

Deep learning – gradient descent Iterative process: Start at random position Compute activations for all samples yn Find partial derivative/gradient for each parameter w (and b) Move each wi (and b) in its gradient direction (actually in negative gradient) Repeat until cost low enough Stochastic gradient/mini batch: Take smaller subset of samples at each step Has still enough gradient information Not perfect – has multiple solutions ! Local minima Plateaus

Deep learning – gradient descent Heuristics to avoid local minima and plateaus Chose step size carefully Too small: slow convergence and unable to escape local maxima Too big: will not converge! Momentum: Considers gradients from previous steps to increase or decrease step size Helps escape local maxima without manually increasing step size parameter Weight decay (regularization) Main goal: want to have only a small number of weights active/big Primarily used to fight overfitting issues but helps to escape local maxima as well Second order derivatives Gradients for wi considers other gradients wj (where i != j) Approximations to second order derivatives Nesterov‘s algorithm AdaGrad AdaDelta …

Deep learning – back-propagation Single neuron: Stacked (deep) neurons: Keep repeating the chaining process from top to bottom Take into account all paths where wi appears Chain rule for derivative of f(g(x))

Deep learning – convolutional net Previous slides all general (not computer vision specific) Appling only fully connected deep neural network to image not feasible Image size 128x128  16k pixels Input neurons = 16k First layer neurons = 4k (lets say we want to reduce dimensions at each layer by 2x) Number of weights = 16*4m (64 mio for first layer only !!) We can exploit spatial locality of images Features are local, only small neighborhood of pixels are needed Feature repeat thorough the image Local connections and weight sharing: Divide neurons into sub-features, each RGB channel is a separate feature One neuron looks only at a small local neighborhood of pixels (3x3, 5x5, 7x7,…) Neurons of the same feature but each at different positions share weights

Deep learning - ReLU How does sigmoid function affect learning? Enables easier computation of derivative but has negative effects: Neuron never reaches 1 or 0  saturating Gradient reduces the magnitude of error Leads to two problems: Slow learning when neurons saturated i.e. big z values Vanishing gradient problem (gradient always 25% of error from previous layer!!)

Deep learning - ReLU Alex Krizhevsky (2011) proposed Rectified Linear Unit instead of sigmoid function Main purpose of ReLu: reduces saturation and vanishing gradient issues Still not perfect: Stops learning at negative z values (can use piecewise linear - Parametric ReLu, He 2015 from Microsoft) Bigger risk of saturating neurons to infinity

Deep learning - dropout Too many weights cause overfitting issues Weight decay (regularization) helps but is not perfect Also adds another hyper-parameter to setup manually Srivastava et al. (2014) proposed a kind of „bagging“ for deep nets (actually Alex Krizhevsky already used it in AlexNet in 2011) Main point: Robustify network by disabling neurons Each neuron has a probability, usually of 0.4, of being disabled Remaining neurons must adept to work without them Applied only to fully connected layers Conv. layers less susceptible to overfitting Srivastava et al., Dropout : A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014

Deep learning – batch norm Input needs to be whitened i.e. normalized (LeCun 1998, Efficient BackProp) Usually done on first layer input only The same reason for normalization of first layer exists for other layers as well Ioffe and Szegedy, Batch Normalization, 2015 Normalize input to each layer Reduce internal covariance shift Too slow to normalize all input data (>1M samples) Instead normalize within mini-batch only Learning: norm over mini-batch data Inference: norm over all trained input data Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 Better results while allowing to use higher learning rate, higher decay, no dropout, no LRN.

Deep learning - residual learning Current state-of-the-art on ImageNet classification: CNN with ~150 layer (by Microsoft China) Key features: Requires a reduction of internal covariance shift (Batch Normalization) Only ~2M parameters (using many small kernels, 1x1, 3x3) CNN with 1500 layers had ~20M parameters and had overfitting issues Adds identity bypass: Why bypass? If layer will not be needed it can simply be ignored; it will just forward input as output By default weights are really small and F(x,{Wi}) is negligible compared to x He et al., Deep Residual Learning for Image Recognition, CVPR2016

Deep learning - visualizing features Difficult to understand internals of CNNs Many visualization attempts, most quite complex Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013 Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR2014 Mahendran et al., Understanding Deep Image Representations by Inverting Them, CVPR2015 Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015 Strange properties of CNNs: adversarial examples Add invisible permutations to pixels  completely incorrect classifications Prediction: unknown a car Image diff: Prediction: a car unknown Image diff: Szegedy et al., Intriguing properties of neural networks, ICLR2014

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR2014

Deep learning - visualizing features Mahendran et al., Understanding Deep Image Representations by Inverting Them, CVPR2015

Deep learning - visualizing features Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015

PART II

Convolutional neural networks Trained filters for part on a second layer: Which parts on first layer are important? Can we deduce anything about the object/part modeled this way? Compositional interpretation?  CNN hierarchical but not compositional

Our approach CNN might learn compositions but compositions are not explicit Cannot utilize advantages of compositions Capture compositions as structure in filters with weight parametrization: Use Gaussian distribution as model:

Compositional neural network Compositional deep nets Convolutional neural nets Possible benefits: Model can be interpreted as hierarchical composition! Reduced number of parameters (faster learning?, less training samples?) Combine generative learning (co-occurance statistics from compositional hierarchies) with discriminative optimization (gradient descent from CNN) Visualizations based on compositions without additional data or complex optimization

Compositional neural network Back-propagation remains the same: Minimize loss function C w.r.t.: Weights Means Variance

Compositional neural network

First layer - activations … Input image (normalized) First and last only blob detectors (with positive or negative weights) Ones in the midle are edge detectors – this can be also deduced from looking at the filter!

16 different channel filters per one feature Second layer - weights 16 different channel filters per one feature 16 different features at second layer Gaussian CNN Standard CNN

Second layer - activations … Input image (normalized) … Some featuers still just edges, but some already corner points Input image (normalized)