Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Sparse vs. Ensemble Approaches to Supervised Learning
Classification and Prediction: Regression Analysis
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Issues with Data Mining
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
NEURAL NETWORKS FOR DATA MINING
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CSC321: Lecture 7:Ways to prevent overfitting
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Regularization Techniques in Neural Networks
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Bagging and Random Forests
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Computer Science and Engineering, Seoul National University
CSE 4705 Artificial Intelligence
Inception and Residual Architecture in Deep Convolutional Networks
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
ECE 6504 Deep Learning for Perception
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
ECE 5424: Introduction to Machine Learning
Roberto Battiti, Mauro Brunato
Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.
A “Holy Grail” of Machine Learing
Generalization ..
Machine Learning Today: Reading: Maria Florina Balcan
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Goodfellow: Chapter 14 Autoencoders
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Word Embedding Word2Vec.
Neural Networks Geoff Hulten.
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Neural networks (3) Regularization Autoencoder
Introduction to Neural Networks
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Chap. 7 Regularization for Deep Learning (7.8~7.12 ) 16.11.07 Electrical & Computer Engineering Parallel Software Design Lab. Taekhee Lee

List 7.8 Early Stopping 7.9 Parameter Tying and Parameter Sharing 7.10 Sparse Representations 7.11 Bagging and Other Ensemble Methods 7.12 Dropout

7.8 Early Stopping When training large model to overfit the task training error , validation error Return to the parameter setting To the point in time with the lowest validation set error Store a copy of the model parameters Every time the error on the validation set improves. When training algorithm terminates, return these parameters rather than the latest. The algorithm terminates when no parameters have improved The best recorded validation error for some pre-specified number of iterations.

7.8 Early Stopping This strategy is known as early stopping. Hyperparameter selection algorithm Controlling the effective capacity of the model by determining how many steps it can take to fit the training set.

7.8 Early Stopping The cost for the “training time” hyperparameter Running validation set evaluation periodically during training Ideally, evaluation is done in parallel with separate machine(CPU or GPU) With no resources available Using validation set small Evaluating the validation set error less frequently Need to maintain a copy of the best parameters But this cost is generally negligible. Since the best parameters are written infrequently and never read during training. ex) Training in GPU memory and store in host memory or on a disk. But still, it is very Unobtrusive form of regularization. Easy to use without damaging the learning dynamics, constrast to weight decay

7.8 Early Stopping Early stopping requires validation set => some training data is not fed to the model. To exploit this extra data, perform extra training after initial training with early stopping has completed In the second, extra training step, all of the training data is included. There are two basic strategies one can use for this second training procedure. [1] Initialize the model again and retrain on all the data. First train the model with early stopping ( train data is divided into train and validation data) In second training pass we train for the same number of steps as the early stopping determined. [2] Keep the parameter obtained from the first and continue training on all the data First train the model with early stopping ( train data is divided into subtrain and validation data) In second training pass, don’t initialize the model, and continue train using all the data.

7.9 Parameter Tying and Parameter Sharing In this chapter, we discussed adding contraints or penalties to the parameters Ex) 𝐿 2 Regularization penalized model parameters for deviating from the fixed value of zero. From knowledge of the domain and model architecture, that there should be some dependencies between the model parameters. If the tasks are similar, then the model parameters should be close to each other 𝑤 𝑖 (𝐴) should be close to 𝑤 𝑖 (𝐵) We can use a parameter norm penalty of the form: Ω 𝑤 (𝐴) ,𝑤 (𝐵) = 𝑤 (𝐴) − 𝑤 (𝐵) 2 2 But other choices are also possible

7.9 Parameter Tying and Parameter Sharing While a parameter norm penalty is one way to regularize parameters to close to one another, the more popular way is to use constraints “To force sets of parameters to be equal” => Parameter sharing Only a subset of the parameters need to be stored in memory. Significant reduction in the memory of the model such as convolutional neural network. But in some cases, should relax the parameter sharing scheme. When we expect completely different features to be learned on different spatial locations.

7.10 Sparse Representations Weight decay places a penalty on the model parameters But there is another strategy that place a penalty on the activations of the units in a neural network. 𝐿 1 penalization induces a sparse parametrization Many of the parameters become zero (or close to zero) But, the representational sparsity is different Many of the elements of the representation are zero (or close to zero)

7.10 Sparse Representations Representational regularization has the same sorts of mechanisms used in parameter regularization Norm penalty regularization of representations is performed by adding to the loss function J a norm penalty on the representation. (𝛼:ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 where 𝛼∈[0, ∞)) 𝐽 𝜃;𝑋, 𝑦 =𝐽 𝜃;𝑋,𝑦 + 𝛼Ω 𝜃 : Parameter regularization 𝐽 𝜃;𝑋, 𝑦 =𝐽 𝜃;𝑋,𝑦 + 𝛼Ω ℎ : Representational regularization

7.11 Bagging and Other Ensemble Methods Bagging (Bootstrap aggregating) is a technique for reducing generalization error by combining several models. Train several different models separately Then have all of the models vote on the output for test examples. An example of a general strategy in the machine learning called model averaging Techniques employing this strategy are know as ensemble methods Why does it work? Different models will usually not make all the same errors on the test set.

7.11 Bagging and Other Ensemble Methods A set of k regression models. Each model makes an error ε 𝑖 A zero-mean multivariate normal distribution variances E( ε 𝑖 2 ) = v, covariances E( ε 𝑖 , ε 𝑗 ) = c The error made by the average prediction of all the ensemble model is 1 𝑘 𝑖 ε 𝑖 The Expected squared error of the ensemble predictor E[ ( 1 𝑘 𝑖 ε 𝑖 ) 2 ]= 1 𝑘 2 E 𝑖 ( ε 𝑖 2 + 𝑗≠1 ε 𝑖 ε 𝑗 ) = 1 𝑘 𝑣+ 𝑘−1 𝑘 𝑐 If errors are correlated & c=v, the mean squared error reduces to v, so the model averaging does not help. If errors are uncorrelated & c=0, the mean squared error of the ensemble is only 𝟏 𝒌 𝒗 Expected squared error of the ensemble decreases linearly with the ensemble size.

7.11 Bagging and Other Ensemble Methods Different ensemble methods construct the ensemble of models in different ways. Bagging is a method that allows the same kind of model, training algorithms and objective function to be reused several times. Construct k different datasets. Each dataset has the same number as the original dataset. Each dataset is constructed by sampling with replacement from the original dataset.

7.12 Dropout Bagging involves training multiple models, and evaluating multiple models on each test example. But it is impractical when each model is a large neural network. It is common to use ensembles of five to ten neural networks. But more than this rapidly becomes unwieldy. Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks

7.12 Dropout Trains an ensemble consisting of all sub-networks. Dropping out different subsets of unit from original network except non-output units. With wider layers, the probability of dropping all possible paths from inputs to outputs becomes smaller.

7.12 Dropout A vector μ with one entry for each input or hidden unit in the network. The entries of μ are binary(0,1) and are sampled independently from each other. The probability of entry is usually 0.5 for hidden layers and 0.8 for the input.

7.12 Dropout In the case of bagging, the models are all independent. In the case of dropout, the models share parameters. So, a neural network with n units can be seen as a collection of 2 𝑛 thinned networks with extensive weight sharing.

7.12 Dropout A bagged ensemble must accumulate votes from all of its members. => a.k.a inference In the case of bagging each model “i” produces a probability distribution 𝑝 𝑖 𝑦 𝑥) Arithmetic mean of all of these distributions, => 𝟏 𝒌 𝒊=𝟏 𝒌 𝒑 𝒊 𝒚 𝒙) In the case of dropout each sub-model defined by mask vector μ => p 𝑦 𝑥,μ) Arithmetic mean over all masks => μ 𝒑(μ)𝐩 𝒚 𝒙,μ) But this sum includes an exponential number of terms. It is not feasible to explicitly average the predictions.

7.12 Dropout A very simple approximate averaging method works well in practice. The idea is to use a single neural network at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time. This ensures that for any hidden unit, the expected output is the same as the actual output at test time.