Batch Normalization.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

Neural networks Introduction Fitting neural networks
Regularization David Kauchak CS 451 – Fall 2013.
Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.
Stochastic Matrix Factorization Max Welling. SMF Last time: The SVD can do a matrix factorization of the user-item-rating matrix. Main question to answer:
Assuming normally distributed data! Naïve Bayes Classifier.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Appendix B: An Example of Back-propagation algorithm
A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Dropout as a Bayesian Approximation
Tips for Training Neural Network
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificiel Neural Networks 3 Tricks for improved learning Morten Nielsen Department of Systems Biology, DTU.
Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Lecture 3a Analysis of training of NN
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Neural networks and support vector machines
Big data classification using neural network
Deep Residual Learning for Image Recognition
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
Reinforcement Learning
Deep Learning Methods For Automated Discourse CIS 700-7
RNNs: An example applied to the prediction task
Deep Feedforward Networks
The Gradient Descent Algorithm
Neural Networks for Quantum Simulation
Computer Science and Engineering, Seoul National University
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Intelligent Information System Lab
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Random walk initialization for training very deep feedforward networks
RNNs: Going Beyond the SRN in Language Prediction
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
BACKPROPOGATION OF NETWORKS
ALL YOU NEED IS A GOOD INIT
CS 4501: Introduction to Computer Vision Training Neural Networks II
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Tips for Training Deep Network
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Very Deep Convolutional Networks for Large-Scale Image Recognition
10701 / Machine Learning Today: - Cross validation,
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
CS 188: Artificial Intelligence Fall 2008
The loss function, the normal equation,
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Mathematical Foundations of BME Reza Shadmehr
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
RNNs: Going Beyond the SRN in Language Prediction
Backpropagation Disclaimer: This PPT is modified based on
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
NN & Optimization Yue Wu.
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Section 3: Second Order Methods
CSC 578 Neural Networks and Deep Learning
Outline Announcement Neural networks Perceptrons - continued
Overall Introduction for the Lecture
Presentation transcript:

Batch Normalization

Feature Scaling …… …… 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑟 𝑥 𝑅 For each dimension i: …… 𝑥 1 1 …… 𝑥 1 2 …… …… …… 𝑥 2 1 𝑥 2 2 …… …… mean: 𝑚 𝑖 standard deviation: 𝜎 𝑖 Can we demo this??????????????? https://standardfrancis.wordpress.com/2015/04/16/batch-normalization/ 𝑥 𝑖 𝑟 ← 𝑥 𝑖 𝑟 − 𝑚 𝑖 𝜎 𝑖 The means of all dimensions are 0, and the variances are all 1 In general, gradient descent converges much faster with feature scaling than without it.

How about Hidden Layer? …… Batch normalization Feature Scaling 𝑥 1 Layer 1 𝑎 1 Layer 2 𝑎 2 …… Difficulty: their statistics change during the training … Normalizes layer inputs to zero mean and unit variance. whitening. Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. If we do it this way gradient always ignores the effect that the normalization for the next batch would have i.e. : “The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place” Batch normalization Smaller learning rate can be helpful, but the training would be slower. Internal Covariate Shift

Batch …… …… …… Batch = 𝑥 1 𝑊 1 𝑧 1 𝑎 1 𝑊 2 𝑥 2 𝑊 1 𝑧 2 𝑎 2 𝑊 2 𝑥 3 𝑊 1 Sigmoid 𝑎 1 𝑊 2 …… 𝑥 2 𝑊 1 𝑧 2 Sigmoid 𝑎 2 𝑊 2 …… 𝑥 3 𝑊 1 𝑧 3 Sigmoid 𝑎 3 𝑊 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ …… 𝑧 1 𝑧 2 𝑧 3 𝑊 1 𝑥 1 𝑥 2 𝑥 3 Batch =

Batch normalization 𝜇= 1 3 𝑖=1 3 𝑧 𝑖 𝑥 1 𝑊 1 𝑧 1 𝜎= 1 3 𝑖=1 3 𝑧 𝑖 −𝜇 2 𝜇= 1 3 𝑖=1 3 𝑧 𝑖 𝑥 1 𝑊 1 𝑧 1 𝜎= 1 3 𝑖=1 3 𝑧 𝑖 −𝜇 2 𝑥 2 𝑊 1 𝑧 2 Note: Batch normalization cannot be applied on small batch. 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝜇 and 𝜎 depends on 𝑧 𝑖 𝜇 𝜎

Batch normalization 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑎 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑥 1 𝑊 1 𝑧 1 𝑧 1 Sigmoid 𝑎 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 Sigmoid 𝑎 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝑧 3 Sigmoid 𝑎 3 𝜇 𝜎 How to do backpropogation? 𝜇 and 𝜎 depends on 𝑧 𝑖

Batch normalization 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑧 1 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 Batch normalization 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑧 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 𝑧 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝑧 3 𝑧 3 𝜇 𝜎 𝛽 𝛾 𝜇 and 𝜎 depends on 𝑧 𝑖

Batch normalization At testing stage: 𝑧 = 𝑧−𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 Acc Updates Batch normalization 𝜇 300 𝜇 100 𝜇 1 At testing stage: 𝑧 = 𝑧−𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 𝑊 1 𝑧 𝑧 𝑧 𝜇, 𝜎 are from batch 𝛾, 𝛽 are network parameters We do not have batch at testing stage. 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ Ideal solution: Computing 𝜇 and 𝜎 using the whole training dataset. Practical solution: Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Batch normalization - Benefit BN reduces training times, and make very deep net trainable. Because of less Covariate Shift, we can use larger learning rates. Less exploding/vanishing gradients Especially effective for sigmoid, tanh, etc. Learning is less affected by initialization. BN reduces the demand for regularization. Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) 𝒌𝒆𝒆𝒑 ×𝒌 ×𝒌 𝑥 𝑖 𝑊 1 𝑧 𝑖 𝑧 𝑖 𝑧 𝑖 𝒌 𝒌 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝒌