CS 4501: Introduction to Computer Vision Training Neural Networks II

Slides:



Advertisements
Similar presentations
ImageNet Classification with Deep Convolutional Neural Networks
Advertisements

Lecture 14 – Neural Networks
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Neural Networks Marco Loog.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Networks
Machine Learning Chapter 4. Artificial Neural Networks
Classification / Regression Neural Networks 2
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Dropout as a Bayesian Approximation
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Lecture 3a Analysis of training of NN
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Understanding Convolutional Neural Networks for Object Recognition
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
RNNs: An example applied to the prediction task
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.
Artificial Neural Networks
Data Mining, Neural Network and Genetic Programming
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
CS 2750: Machine Learning Neural Networks
COMP24111: Machine Learning and Optimisation
Generative Adversarial Networks
Lecture 24: Convolutional neural networks
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Training Techniques for Deep Neural Networks
CS 188: Artificial Intelligence
Deep Belief Networks Psychology 209 February 22, 2013.
RNNs: Going Beyond the SRN in Language Prediction
CSC 578 Neural Networks and Deep Learning
Machine Learning Today: Reading: Maria Florina Balcan
Image Classification.
Tips for Training Deep Network
ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power
Very Deep Convolutional Networks for Large-Scale Image Recognition
Artificial Neural Networks
Word Embedding Word2Vec.
[Figure taken from googleblog
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Neural Networks Geoff Hulten.
Machine Learning – Neural Networks David Fenyő
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
RNNs: Going Beyond the SRN in Language Prediction
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Backpropagation and Neural Nets
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Batch Normalization.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
CSC 578 Neural Networks and Deep Learning
Principles of Back-Propagation
An introduction to neural network and machine learning
Overall Introduction for the Lecture
Presentation transcript:

CS 4501: Introduction to Computer Vision Training Neural Networks II Connelly Barnes

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Preprocessing Common: zero-center, can sometimes also normalize variance. Slide from Stanford CS231n

Preprocessing Can also decorrelate the data by using PCA, or whiten data Slide from Stanford CS231n

Preprocessing for Images Center the data only: Compute a mean image of training set (examples of mean faces) Either grayscale or compute separate mean for channels (RGB) Subtract the mean from your dataset Center the RGB channels only: Subtract mean training value of each channel

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Initialization Need to start gradient descent at an initial guess What happens if we initialize all weights to zero? Slide from Stanford CS231n

𝑤 𝑖𝑗 =𝒩(𝜇,𝜎) 𝜎 = const 𝜇=0, Initialization Idea: random numbers (e.g. normal distribution) OK for shallow networks, but what about deep networks? 𝑤 𝑖𝑗 =𝒩(𝜇,𝜎) 𝜎 = const 𝜇=0,

Initialization, 𝜎 = 0.01 Are there any problems with this? Simulation: multilayer perceptron, 10 fully-connected hidden layers Tanh() activation function Hidden layer activation function statistics: Are there any problems with this? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Initialization, 𝜎 = 1 Are there any problems with this? Simulation: multilayer perceptron, 10 fully-connected hidden layers Tanh() activation function Hidden layer activation function statistics: Are there any problems with this? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Xavier Initialization 𝑛 in : Number of neurons feeding into given neuron (actually, Xavier used a uniform distribution) Hidden layer activation function statistics: Reasonable initialization for tanh() activation function. But what happens with ReLU? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Xavier Initialization, ReLU 𝑛 in : Number of neurons feeding into given neuron Hidden layer activation function statistics: Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

He et al. 2015 Initialization, ReLU 𝑛 in : Number of neurons feeding into given neuron Hidden layer activation function statistics: Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Vanishing/exploding gradient problem Recall that backpropagation algorithm uses chain rule. Products of many derivatives…could get very big/small Could this be problematic for learning?

Vanishing/exploding gradient problem Vanishing gradients problem: neurons in earlier layers learn more slowly than in latter layers. Image from Nielson 2015

Vanishing/exploding gradient problem Vanishing gradients problem: neurons in earlier layers learn more slowly than in latter layers. Exploding gradients problem: gradients are significantly larger in earlier layers than latter layers. How to avoid? Use a good initialization Do not use sigmoid for deep networks Use momentum with carefully tuned schedules, e.g.: Image from Nielson 2015

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Batch normalization It would be great if we could just whiten the inputs to all neurons in a layer: i.e. zero mean, variance of 1. Avoid vanishing gradients problem, improve learning rates! For each input k to the next layer: Slight problem: this reduces representation ability of network Why?

Get stuck in this part of the activation function Batch normalization It would be great if we could just whiten the inputs to all neurons in a layer: i.e. zero mean, variance of 1. Avoid vanishing gradients problem, improve learning rates! For each input k to the next layer: Slight problem: this reduces representation ability of network Why? Get stuck in this part of the activation function

Batch normalization First whiten each input k independently, using statistics from the mini-batch: Then introduce parameters to scale and shift each input: These parameters are learned by the optimization.

Batch normalization

Dropout: regularization Randomly zero outputs of p fraction of the neurons during training Can we learn representations that are robust to loss of neurons? Intuition: learn and remember useful information even if there are some errors in the computation (biological connection?) Slide from Stanford CS231n

Dropout Another interpretation: we are learning a large ensemble of models that share weights. Slide from Stanford CS231n

Dropout Another interpretation: we are learning a large ensemble of models that share weights. What can we do during testing to correct for the dropout process? Multiply all neurons outputs by p. Or equivalently (inverse dropout) simply divide all neurons outputs by p during training. Slide from Stanford CS231n

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Softmax Often used in final output layer to convert neuron outputs into a class probability scores that sum to 1. For example, might want to convert the final network output to: P(dog) = 0.2 (Probabilities in range [0, 1]) P(cat) = 0.8 (Sum of all probabilities is 1).

Softmax Softmax takes a vector z and outputs a vector of the same length.

Softmax Example from Wikipedia: Input: [1, 2, 3, 4, 1, 2, 3 ] Output: [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175] Most energy is concentrated where maximum element is.

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Transfer Learning / Fine-tuning Neural networks are data hungry… Need tons of data to avoid overfitting e.g. ImageNet has millions of images What if we have a problem with few images? Collect more data? (use Amazon Mechanical Turk?) What if collecting more data is too expensive / difficult? Use transfer learning (fine-tuning) 0, 1, 0, 0, 1, 0, 1, 1 Image by Sam Howzit, Creative Commons

Transfer Learning / Fine-tuning Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy

Transfer Learning / Fine-tuning Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy

Transfer Learning / Fine-tuning Use lower learning rate (1/10) Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy