CS 4501: Introduction to Computer Vision Training Neural Networks II

Slides:

Advertisements

Similar presentations

ImageNet Classification with Deep Convolutional Neural Networks

Advertisements

Lecture 14 – Neural Networks

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Neural Networks Marco Loog.

Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Artificial Neural Networks

LOGO Classification III Lecturer: Dr. Bo Yuan

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Artificial Neural Networks

Machine Learning Chapter 4. Artificial Neural Networks

Classification / Regression Neural Networks 2

ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.

ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.

Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.

Dropout as a Bayesian Approximation

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Lecture 3a Analysis of training of NN

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Understanding Convolutional Neural Networks for Object Recognition

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

RNNs: An example applied to the prediction task

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Feedforward Networks

CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.

Artificial Neural Networks

Data Mining, Neural Network and Genetic Programming

ECE 5424: Introduction to Machine Learning

Computer Science and Engineering, Seoul National University

CS 2750: Machine Learning Neural Networks

COMP24111: Machine Learning and Optimisation

Generative Adversarial Networks

Lecture 24: Convolutional neural networks

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Training Techniques for Deep Neural Networks

CS 188: Artificial Intelligence

Deep Belief Networks Psychology 209 February 22, 2013.

RNNs: Going Beyond the SRN in Language Prediction

CSC 578 Neural Networks and Deep Learning

Machine Learning Today: Reading: Maria Florina Balcan

Image Classification.

Tips for Training Deep Network

ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power

Very Deep Convolutional Networks for Large-Scale Image Recognition

Artificial Neural Networks

Word Embedding Word2Vec.

[Figure taken from googleblog

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Neural Networks Geoff Hulten.

Machine Learning – Neural Networks David Fenyő

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

RNNs: Going Beyond the SRN in Language Prediction

Backpropagation David Kauchak CS159 – Fall 2019.

Image Classification & Training of Neural Networks

Backpropagation and Neural Nets

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Batch Normalization.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

CSC 578 Neural Networks and Deep Learning

Principles of Back-Propagation

An introduction to neural network and machine learning

Overall Introduction for the Lecture

Presentation transcript:

CS 4501: Introduction to Computer Vision Training Neural Networks II Connelly Barnes

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Preprocessing Common: zero-center, can sometimes also normalize variance. Slide from Stanford CS231n

Preprocessing Can also decorrelate the data by using PCA, or whiten data Slide from Stanford CS231n

Preprocessing for Images Center the data only: Compute a mean image of training set (examples of mean faces) Either grayscale or compute separate mean for channels (RGB) Subtract the mean from your dataset Center the RGB channels only: Subtract mean training value of each channel

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Initialization Need to start gradient descent at an initial guess What happens if we initialize all weights to zero? Slide from Stanford CS231n

𝑤 𝑖𝑗 =𝒩(𝜇,𝜎) 𝜎 = const 𝜇=0, Initialization Idea: random numbers (e.g. normal distribution) OK for shallow networks, but what about deep networks? 𝑤 𝑖𝑗 =𝒩(𝜇,𝜎) 𝜎 = const 𝜇=0,

Initialization, 𝜎 = 0.01 Are there any problems with this? Simulation: multilayer perceptron, 10 fully-connected hidden layers Tanh() activation function Hidden layer activation function statistics: Are there any problems with this? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Initialization, 𝜎 = 1 Are there any problems with this? Simulation: multilayer perceptron, 10 fully-connected hidden layers Tanh() activation function Hidden layer activation function statistics: Are there any problems with this? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Xavier Initialization 𝑛 in : Number of neurons feeding into given neuron (actually, Xavier used a uniform distribution) Hidden layer activation function statistics: Reasonable initialization for tanh() activation function. But what happens with ReLU? Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Xavier Initialization, ReLU 𝑛 in : Number of neurons feeding into given neuron Hidden layer activation function statistics: Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

He et al. 2015 Initialization, ReLU 𝑛 in : Number of neurons feeding into given neuron Hidden layer activation function statistics: Hidden Layer 1 Hidden Layer 10 Slide from Stanford CS231n

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Vanishing/exploding gradient problem Recall that backpropagation algorithm uses chain rule. Products of many derivatives…could get very big/small Could this be problematic for learning?

Vanishing/exploding gradient problem Vanishing gradients problem: neurons in earlier layers learn more slowly than in latter layers. Image from Nielson 2015

Vanishing/exploding gradient problem Vanishing gradients problem: neurons in earlier layers learn more slowly than in latter layers. Exploding gradients problem: gradients are significantly larger in earlier layers than latter layers. How to avoid? Use a good initialization Do not use sigmoid for deep networks Use momentum with carefully tuned schedules, e.g.: Image from Nielson 2015

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Batch normalization It would be great if we could just whiten the inputs to all neurons in a layer: i.e. zero mean, variance of 1. Avoid vanishing gradients problem, improve learning rates! For each input k to the next layer: Slight problem: this reduces representation ability of network Why?

Get stuck in this part of the activation function Batch normalization It would be great if we could just whiten the inputs to all neurons in a layer: i.e. zero mean, variance of 1. Avoid vanishing gradients problem, improve learning rates! For each input k to the next layer: Slight problem: this reduces representation ability of network Why? Get stuck in this part of the activation function

Batch normalization First whiten each input k independently, using statistics from the mini-batch: Then introduce parameters to scale and shift each input: These parameters are learned by the optimization.

Batch normalization

Dropout: regularization Randomly zero outputs of p fraction of the neurons during training Can we learn representations that are robust to loss of neurons? Intuition: learn and remember useful information even if there are some errors in the computation (biological connection?) Slide from Stanford CS231n

Dropout Another interpretation: we are learning a large ensemble of models that share weights. Slide from Stanford CS231n

Dropout Another interpretation: we are learning a large ensemble of models that share weights. What can we do during testing to correct for the dropout process? Multiply all neurons outputs by p. Or equivalently (inverse dropout) simply divide all neurons outputs by p during training. Slide from Stanford CS231n

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Softmax Often used in final output layer to convert neuron outputs into a class probability scores that sum to 1. For example, might want to convert the final network output to: P(dog) = 0.2 (Probabilities in range [0, 1]) P(cat) = 0.8 (Sum of all probabilities is 1).

Softmax Softmax takes a vector z and outputs a vector of the same length.

Softmax Example from Wikipedia: Input: [1, 2, 3, 4, 1, 2, 3 ] Output: [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175] Most energy is concentrated where maximum element is.

Overview Preprocessing Improving convergence Initialization Vanishing/exploding gradients problem Improving generalization Batch normalization Dropout Additional neuron type: softmax Transfer learning / fine-tuning

Transfer Learning / Fine-tuning Neural networks are data hungry… Need tons of data to avoid overfitting e.g. ImageNet has millions of images What if we have a problem with few images? Collect more data? (use Amazon Mechanical Turk?) What if collecting more data is too expensive / difficult? Use transfer learning (fine-tuning) 0, 1, 0, 0, 1, 0, 1, 1 Image by Sam Howzit, Creative Commons

Transfer Learning / Fine-tuning Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy

Transfer Learning / Fine-tuning Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy

Transfer Learning / Fine-tuning Use lower learning rate (1/10) Slides from Fei-Fei Li, Justin Johnson, Andrej Karpathy