Batch Normalization.

Slides:

Advertisements

Similar presentations

NEURAL NETWORKS Backpropagation Algorithm

Advertisements

Neural networks Introduction Fitting neural networks

Regularization David Kauchak CS 451 – Fall 2013.

Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.

Stochastic Matrix Factorization Max Welling. SMF Last time: The SVD can do a matrix factorization of the user-item-rating matrix. Main question to answer:

Assuming normally distributed data! Naïve Bayes Classifier.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Appendix B: An Example of Back-propagation algorithm

A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Dropout as a Bayesian Approximation

Tips for Training Neural Network

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Artificiel Neural Networks 3 Tricks for improved learning Morten Nielsen Department of Systems Biology, DTU.

Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Lecture 3a Analysis of training of NN

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

Neural networks and support vector machines

Big data classification using neural network

Deep Residual Learning for Image Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.

Reinforcement Learning

Deep Learning Methods For Automated Discourse CIS 700-7

RNNs: An example applied to the prediction task

Deep Feedforward Networks

The Gradient Descent Algorithm

Neural Networks for Quantum Simulation

Computer Science and Engineering, Seoul National University

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Intelligent Information System Lab

Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.

Random walk initialization for training very deep feedforward networks

RNNs: Going Beyond the SRN in Language Prediction

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning

ECE 471/571 - Lecture 17 Back Propagation.

BACKPROPOGATION OF NETWORKS

ALL YOU NEED IS A GOOD INIT

CS 4501: Introduction to Computer Vision Training Neural Networks II

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Tips for Training Deep Network

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Very Deep Convolutional Networks for Large-Scale Image Recognition

10701 / Machine Learning Today: - Cross validation,

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Neural Networks Geoff Hulten.

Deep Learning for Non-Linear Control

CS 188: Artificial Intelligence Fall 2008

The loss function, the normal equation,

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Mathematical Foundations of BME Reza Shadmehr

#21 Marginalize vs. Condition Uninteresting Fitted Parameters

RNNs: Going Beyond the SRN in Language Prediction

Backpropagation Disclaimer: This PPT is modified based on

Neural networks (1) Traditional multi-layer perceptrons

Backpropagation David Kauchak CS159 – Fall 2019.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

NN & Optimization Yue Wu.

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Section 3: Second Order Methods

CSC 578 Neural Networks and Deep Learning

Outline Announcement Neural networks Perceptrons - continued

Overall Introduction for the Lecture

Presentation transcript:

Batch Normalization

Feature Scaling …… …… 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑟 𝑥 𝑅 For each dimension i: …… 𝑥 1 1 …… 𝑥 1 2 …… …… …… 𝑥 2 1 𝑥 2 2 …… …… mean: 𝑚 𝑖 standard deviation: 𝜎 𝑖 Can we demo this??????????????? https://standardfrancis.wordpress.com/2015/04/16/batch-normalization/ 𝑥 𝑖 𝑟 ← 𝑥 𝑖 𝑟 − 𝑚 𝑖 𝜎 𝑖 The means of all dimensions are 0, and the variances are all 1 In general, gradient descent converges much faster with feature scaling than without it.

How about Hidden Layer? …… Batch normalization Feature Scaling 𝑥 1 Layer 1 𝑎 1 Layer 2 𝑎 2 …… Difficulty: their statistics change during the training … Normalizes layer inputs to zero mean and unit variance. whitening. Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. If we do it this way gradient always ignores the effect that the normalization for the next batch would have i.e. : “The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place” Batch normalization Smaller learning rate can be helpful, but the training would be slower. Internal Covariate Shift

Batch …… …… …… Batch = 𝑥 1 𝑊 1 𝑧 1 𝑎 1 𝑊 2 𝑥 2 𝑊 1 𝑧 2 𝑎 2 𝑊 2 𝑥 3 𝑊 1 Sigmoid 𝑎 1 𝑊 2 …… 𝑥 2 𝑊 1 𝑧 2 Sigmoid 𝑎 2 𝑊 2 …… 𝑥 3 𝑊 1 𝑧 3 Sigmoid 𝑎 3 𝑊 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ …… 𝑧 1 𝑧 2 𝑧 3 𝑊 1 𝑥 1 𝑥 2 𝑥 3 Batch =

Batch normalization 𝜇= 1 3 𝑖=1 3 𝑧 𝑖 𝑥 1 𝑊 1 𝑧 1 𝜎= 1 3 𝑖=1 3 𝑧 𝑖 −𝜇 2 𝜇= 1 3 𝑖=1 3 𝑧 𝑖 𝑥 1 𝑊 1 𝑧 1 𝜎= 1 3 𝑖=1 3 𝑧 𝑖 −𝜇 2 𝑥 2 𝑊 1 𝑧 2 Note: Batch normalization cannot be applied on small batch. 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝜇 and 𝜎 depends on 𝑧 𝑖 𝜇 𝜎

Batch normalization 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑎 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑥 1 𝑊 1 𝑧 1 𝑧 1 Sigmoid 𝑎 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 Sigmoid 𝑎 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝑧 3 Sigmoid 𝑎 3 𝜇 𝜎 How to do backpropogation? 𝜇 and 𝜎 depends on 𝑧 𝑖

Batch normalization 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑧 1 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 Batch normalization 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 1 𝑊 1 𝑧 1 𝑧 1 𝑧 1 𝑥 2 𝑊 1 𝑧 2 𝑧 2 𝑧 2 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ 𝑥 3 𝑊 1 𝑧 3 𝑧 3 𝑧 3 𝜇 𝜎 𝛽 𝛾 𝜇 and 𝜎 depends on 𝑧 𝑖

Batch normalization At testing stage: 𝑧 = 𝑧−𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 Acc Updates Batch normalization 𝜇 300 𝜇 100 𝜇 1 At testing stage: 𝑧 = 𝑧−𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝑥 𝑊 1 𝑧 𝑧 𝑧 𝜇, 𝜎 are from batch 𝛾, 𝛽 are network parameters We do not have batch at testing stage. 上學期好多人做! https://ntumlds.wordpress.com/2017/03/26/r05922005_rrrr/ Ideal solution: Computing 𝜇 and 𝜎 using the whole training dataset. Practical solution: Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Batch normalization - Benefit BN reduces training times, and make very deep net trainable. Because of less Covariate Shift, we can use larger learning rates. Less exploding/vanishing gradients Especially effective for sigmoid, tanh, etc. Learning is less affected by initialization. BN reduces the demand for regularization. Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) 𝒌𝒆𝒆𝒑 ×𝒌 ×𝒌 𝑥 𝑖 𝑊 1 𝑧 𝑖 𝑧 𝑖 𝑧 𝑖 𝒌 𝒌 𝑧 𝑖 = 𝑧 𝑖 −𝜇 𝜎 𝑧 𝑖 =𝛾⨀ 𝑧 𝑖 +𝛽 𝒌