Chap. 7 Regularization for Deep Learning (7.8~7.12 ) 16.11.07 Electrical & Computer Engineering Parallel Software Design Lab. Taekhee Lee
List 7.8 Early Stopping 7.9 Parameter Tying and Parameter Sharing 7.10 Sparse Representations 7.11 Bagging and Other Ensemble Methods 7.12 Dropout
7.8 Early Stopping When training large model to overfit the task training error , validation error Return to the parameter setting To the point in time with the lowest validation set error Store a copy of the model parameters Every time the error on the validation set improves. When training algorithm terminates, return these parameters rather than the latest. The algorithm terminates when no parameters have improved The best recorded validation error for some pre-specified number of iterations.
7.8 Early Stopping This strategy is known as early stopping. Hyperparameter selection algorithm Controlling the effective capacity of the model by determining how many steps it can take to fit the training set.
7.8 Early Stopping The cost for the “training time” hyperparameter Running validation set evaluation periodically during training Ideally, evaluation is done in parallel with separate machine(CPU or GPU) With no resources available Using validation set small Evaluating the validation set error less frequently Need to maintain a copy of the best parameters But this cost is generally negligible. Since the best parameters are written infrequently and never read during training. ex) Training in GPU memory and store in host memory or on a disk. But still, it is very Unobtrusive form of regularization. Easy to use without damaging the learning dynamics, constrast to weight decay
7.8 Early Stopping Early stopping requires validation set => some training data is not fed to the model. To exploit this extra data, perform extra training after initial training with early stopping has completed In the second, extra training step, all of the training data is included. There are two basic strategies one can use for this second training procedure. [1] Initialize the model again and retrain on all the data. First train the model with early stopping ( train data is divided into train and validation data) In second training pass we train for the same number of steps as the early stopping determined. [2] Keep the parameter obtained from the first and continue training on all the data First train the model with early stopping ( train data is divided into subtrain and validation data) In second training pass, don’t initialize the model, and continue train using all the data.
7.9 Parameter Tying and Parameter Sharing In this chapter, we discussed adding contraints or penalties to the parameters Ex) 𝐿 2 Regularization penalized model parameters for deviating from the fixed value of zero. From knowledge of the domain and model architecture, that there should be some dependencies between the model parameters. If the tasks are similar, then the model parameters should be close to each other 𝑤 𝑖 (𝐴) should be close to 𝑤 𝑖 (𝐵) We can use a parameter norm penalty of the form: Ω 𝑤 (𝐴) ,𝑤 (𝐵) = 𝑤 (𝐴) − 𝑤 (𝐵) 2 2 But other choices are also possible
7.9 Parameter Tying and Parameter Sharing While a parameter norm penalty is one way to regularize parameters to close to one another, the more popular way is to use constraints “To force sets of parameters to be equal” => Parameter sharing Only a subset of the parameters need to be stored in memory. Significant reduction in the memory of the model such as convolutional neural network. But in some cases, should relax the parameter sharing scheme. When we expect completely different features to be learned on different spatial locations.
7.10 Sparse Representations Weight decay places a penalty on the model parameters But there is another strategy that place a penalty on the activations of the units in a neural network. 𝐿 1 penalization induces a sparse parametrization Many of the parameters become zero (or close to zero) But, the representational sparsity is different Many of the elements of the representation are zero (or close to zero)
7.10 Sparse Representations Representational regularization has the same sorts of mechanisms used in parameter regularization Norm penalty regularization of representations is performed by adding to the loss function J a norm penalty on the representation. (𝛼:ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 where 𝛼∈[0, ∞)) 𝐽 𝜃;𝑋, 𝑦 =𝐽 𝜃;𝑋,𝑦 + 𝛼Ω 𝜃 : Parameter regularization 𝐽 𝜃;𝑋, 𝑦 =𝐽 𝜃;𝑋,𝑦 + 𝛼Ω ℎ : Representational regularization
7.11 Bagging and Other Ensemble Methods Bagging (Bootstrap aggregating) is a technique for reducing generalization error by combining several models. Train several different models separately Then have all of the models vote on the output for test examples. An example of a general strategy in the machine learning called model averaging Techniques employing this strategy are know as ensemble methods Why does it work? Different models will usually not make all the same errors on the test set.
7.11 Bagging and Other Ensemble Methods A set of k regression models. Each model makes an error ε 𝑖 A zero-mean multivariate normal distribution variances E( ε 𝑖 2 ) = v, covariances E( ε 𝑖 , ε 𝑗 ) = c The error made by the average prediction of all the ensemble model is 1 𝑘 𝑖 ε 𝑖 The Expected squared error of the ensemble predictor E[ ( 1 𝑘 𝑖 ε 𝑖 ) 2 ]= 1 𝑘 2 E 𝑖 ( ε 𝑖 2 + 𝑗≠1 ε 𝑖 ε 𝑗 ) = 1 𝑘 𝑣+ 𝑘−1 𝑘 𝑐 If errors are correlated & c=v, the mean squared error reduces to v, so the model averaging does not help. If errors are uncorrelated & c=0, the mean squared error of the ensemble is only 𝟏 𝒌 𝒗 Expected squared error of the ensemble decreases linearly with the ensemble size.
7.11 Bagging and Other Ensemble Methods Different ensemble methods construct the ensemble of models in different ways. Bagging is a method that allows the same kind of model, training algorithms and objective function to be reused several times. Construct k different datasets. Each dataset has the same number as the original dataset. Each dataset is constructed by sampling with replacement from the original dataset.
7.12 Dropout Bagging involves training multiple models, and evaluating multiple models on each test example. But it is impractical when each model is a large neural network. It is common to use ensembles of five to ten neural networks. But more than this rapidly becomes unwieldy. Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks
7.12 Dropout Trains an ensemble consisting of all sub-networks. Dropping out different subsets of unit from original network except non-output units. With wider layers, the probability of dropping all possible paths from inputs to outputs becomes smaller.
7.12 Dropout A vector μ with one entry for each input or hidden unit in the network. The entries of μ are binary(0,1) and are sampled independently from each other. The probability of entry is usually 0.5 for hidden layers and 0.8 for the input.
7.12 Dropout In the case of bagging, the models are all independent. In the case of dropout, the models share parameters. So, a neural network with n units can be seen as a collection of 2 𝑛 thinned networks with extensive weight sharing.
7.12 Dropout A bagged ensemble must accumulate votes from all of its members. => a.k.a inference In the case of bagging each model “i” produces a probability distribution 𝑝 𝑖 𝑦 𝑥) Arithmetic mean of all of these distributions, => 𝟏 𝒌 𝒊=𝟏 𝒌 𝒑 𝒊 𝒚 𝒙) In the case of dropout each sub-model defined by mask vector μ => p 𝑦 𝑥,μ) Arithmetic mean over all masks => μ 𝒑(μ)𝐩 𝒚 𝒙,μ) But this sum includes an exponential number of terms. It is not feasible to explicitly average the predictions.
7.12 Dropout A very simple approximate averaging method works well in practice. The idea is to use a single neural network at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time. This ensures that for any hidden unit, the expected output is the same as the actual output at test time.