Understanding Convolutional Neural Networks for Object Recognition

Slides:

Advertisements

Similar presentations

Beyond Linear Separability

Advertisements

Neural networks Introduction Fitting neural networks

ImageNet Classification with Deep Convolutional Neural Networks

Artificial Neural Networks

Machine learning Image source:

Machine learning Image source:

Artificial Neural Networks

Machine Learning Chapter 4. Artificial Neural Networks

Non-Bayes classifiers. Linear discriminants, neural networks.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

EEE502 Pattern Recognition

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Convolutional Neural Network

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Lecture 3a Analysis of training of NN

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Convolutional Neural Networks

Today’s Lecture Neural networks Training

Neural networks and support vector machines

Big data classification using neural network

Deep Residual Learning for Image Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.

Regularization Techniques in Neural Networks

Deep Feedforward Networks

Artificial Neural Networks

Training convolutional networks

Data Mining, Neural Network and Genetic Programming

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

ECE 5424: Introduction to Machine Learning

Computer Science and Engineering, Seoul National University

Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.

Generative Adversarial Networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

ECE 6504 Deep Learning for Perception

Neural Networks 2 CS446 Machine Learning.

Convolution Neural Networks

Training Techniques for Deep Neural Networks

Convolutional Networks

Deep Belief Networks Psychology 209 February 22, 2013.

Machine Learning: The Connectionist

Neural Networks and Backpropagation

RNNs: Going Beyond the SRN in Language Prediction

Machine Learning Today: Reading: Maria Florina Balcan

Computer Vision James Hays

ECE 471/571 - Lecture 17 Back Propagation.

Introduction to Neural Networks

Image Classification.

CS 4501: Introduction to Computer Vision Training Neural Networks II

Tips for Training Deep Network

Very Deep Convolutional Networks for Large-Scale Image Recognition

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

LECTURE 35: Introduction to EEG Processing

Convolutional networks

Neural Networks Geoff Hulten.

LECTURE 33: Alternative OPTIMIZERS

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Neural networks (1) Traditional multi-layer perceptrons

Image Classification & Training of Neural Networks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Reuben Feinman Research advised by Brenden Lake

Introduction to Neural Networks

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Batch Normalization.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Image recognition.

Principles of Back-Propagation

Overall Introduction for the Lecture

Presentation transcript:

Understanding Convolutional Neural Networks for Object Recognition Domen Tabernik University of Ljubljana Faculty of Computer and Information Science Visual Cognitive Systems Laboratory

Visual object recognition How to capture representation of objects in computational/mathematical model? Impossible to explicitly model each one - too many objects, too many variations Let machine learn the model from samples Inspiration from biological (human/primate) visual system Key element: a hierarchy Raw pixels (x,y,RGB/Gray) ? Objects: a category [position or segmentation] Biological plausibility Part sharing between objects/categories  Efficient representation Object/part as a composition of other parts Compositional interpretation Kruger et al, Deep Hierarchies in the Primate Visual Cortex : What Can We Learn For Computer Vision ?, PAMI 2012

Deep learning – a sigmoid neuron Basic element: a sigmoid neuron (improved perceptron from 60‘s) Mathematical form: Weighted linear combination of inputs + bias Sigmoid function: Why sigmoid? Equals to a smooth threshold function Smoothness  nice mathematical properties (derivatives) Threshold  adds non-linearity when stacked Captures more complex representations A probability of a car y =0.78 Image/pixels

Deep learning – a sigmoid neuron Basic element: a sigmoid neuron (improved perceptron from 60‘s) Learning process: Known values: x, y  learning input values Unknown values: w, b  learned output parameters Which w,b will for ALL learning images xn produce its correct output yn? Basically, cost is an average difference between neuron‘s outputs and actual correct outputs A probability of a car y =0.78 Image/pixels

Deep learning – optimization Best solution when cost is lowest, therefore our goal is a minimal C(w,b): Basic optimization problem: When is the function at a minimal point? (high-school math problem) When its derivative is at ZERO. How to find ZERO derivative? Analytically? Need N > num(w) Not possible when stacked Naive iterative approach: Start at random position Find small combination of ∆w to min. C If num(w) big, too many combinations and checks Use gradient descent instead!

Deep learning – gradient descent Iterative process: Start at random position Compute activations for all samples yn Find partial derivative/gradient for each parameter w (and b) Move each wi (and b) in its gradient direction (actually in negative gradient) Repeat until cost low enough Stochastic gradient/mini batch: Take smaller subset of samples at each step Has still enough gradient information Not perfect – has multiple solutions ! Local minima Plateaus

Deep learning – gradient descent Heuristics to avoid local minima and plateaus Chose step size carefully Too small: slow convergence and unable to escape local maxima Too big: will not converge! Momentum: Considers gradients from previous steps to increase or decrease step size Helps escape local maxima without manually increasing step size parameter Weight decay (regularization) Main goal: want to have only a small number of weights active/big Primarily used to fight overfitting issues but helps to escape local maxima as well Second order derivatives Gradients for wi considers other gradients wj (where i != j) Approximations to second order derivatives Nesterov‘s algorithm AdaGrad AdaDelta …

Deep learning – back-propagation Single neuron: Stacked (deep) neurons: Keep repeating the chaining process from top to bottom Take into account all paths where wi appears Chain rule for derivative of f(g(x))

Deep learning – convolutional net Previous slides all general (not computer vision specific) Appling only fully connected deep neural network to image not feasible Image size 128x128  16k pixels Input neurons = 16k First layer neurons = 4k (lets say we want to reduce dimensions at each layer by 2x) Number of weights = 16*4m (64 mio for first layer only !!) We can exploit spatial locality of images Features are local, only small neighborhood of pixels are needed Feature repeat thorough the image Local connections and weight sharing: Divide neurons into sub-features, each RGB channel is a separate feature One neuron looks only at a small local neighborhood of pixels (3x3, 5x5, 7x7,…) Neurons of the same feature but each at different positions share weights

Deep learning - ReLU How does sigmoid function affect learning? Enables easier computation of derivative but has negative effects: Neuron never reaches 1 or 0  saturating Gradient reduces the magnitude of error Leads to two problems: Slow learning when neurons saturated i.e. big z values Vanishing gradient problem (gradient always 25% of error from previous layer!!)

Deep learning - ReLU Alex Krizhevsky (2011) proposed Rectified Linear Unit instead of sigmoid function Main purpose of ReLu: reduces saturation and vanishing gradient issues Still not perfect: Stops learning at negative z values (can use piecewise linear - Parametric ReLu, He 2015 from Microsoft) Bigger risk of saturating neurons to infinity

Deep learning - dropout Too many weights cause overfitting issues Weight decay (regularization) helps but is not perfect Also adds another hyper-parameter to setup manually Srivastava et al. (2014) proposed a kind of „bagging“ for deep nets (actually Alex Krizhevsky already used it in AlexNet in 2011) Main point: Robustify network by disabling neurons Each neuron has a probability, usually of 0.4, of being disabled Remaining neurons must adept to work without them Applied only to fully connected layers Conv. layers less susceptible to overfitting Srivastava et al., Dropout : A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014

Deep learning – batch norm Input needs to be whitened i.e. normalized (LeCun 1998, Efficient BackProp) Usually done on first layer input only The same reason for normalization of first layer exists for other layers as well Ioffe and Szegedy, Batch Normalization, 2015 Normalize input to each layer Reduce internal covariance shift Too slow to normalize all input data (>1M samples) Instead normalize within mini-batch only Learning: norm over mini-batch data Inference: norm over all trained input data Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 Better results while allowing to use higher learning rate, higher decay, no dropout, no LRN.

Deep learning - residual learning Current state-of-the-art on ImageNet classification: CNN with ~150 layer (by Microsoft China) Key features: Requires a reduction of internal covariance shift (Batch Normalization) Only ~2M parameters (using many small kernels, 1x1, 3x3) CNN with 1500 layers had ~20M parameters and had overfitting issues Adds identity bypass: Why bypass? If layer will not be needed it can simply be ignored; it will just forward input as output By default weights are really small and F(x,{Wi}) is negligible compared to x He et al., Deep Residual Learning for Image Recognition, CVPR2016

Deep learning - visualizing features Difficult to understand internals of CNNs Many visualization attempts, most quite complex Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013 Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR2014 Mahendran et al., Understanding Deep Image Representations by Inverting Them, CVPR2015 Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015 Strange properties of CNNs: adversarial examples Add invisible permutations to pixels  completely incorrect classifications Prediction: unknown a car Image diff: Prediction: a car unknown Image diff: Szegedy et al., Intriguing properties of neural networks, ICLR2014

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013

Deep learning - visualizing features Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, ICLR2014

Deep learning - visualizing features Mahendran et al., Understanding Deep Image Representations by Inverting Them, CVPR2015

Deep learning - visualizing features Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015

PART II

Convolutional neural networks Trained filters for part on a second layer: Which parts on first layer are important? Can we deduce anything about the object/part modeled this way? Compositional interpretation?  CNN hierarchical but not compositional

Our approach CNN might learn compositions but compositions are not explicit Cannot utilize advantages of compositions Capture compositions as structure in filters with weight parametrization: Use Gaussian distribution as model:

Compositional neural network Compositional deep nets Convolutional neural nets Possible benefits: Model can be interpreted as hierarchical composition! Reduced number of parameters (faster learning?, less training samples?) Combine generative learning (co-occurance statistics from compositional hierarchies) with discriminative optimization (gradient descent from CNN) Visualizations based on compositions without additional data or complex optimization

Compositional neural network Back-propagation remains the same: Minimize loss function C w.r.t.: Weights Means Variance

Compositional neural network

First layer - activations … Input image (normalized) First and last only blob detectors (with positive or negative weights) Ones in the midle are edge detectors – this can be also deduced from looking at the filter!

16 different channel filters per one feature Second layer - weights 16 different channel filters per one feature 16 different features at second layer Gaussian CNN Standard CNN

Second layer - activations … Input image (normalized) … Some featuers still just edges, but some already corner points Input image (normalized)