Download presentation
Published byBuddy Atkins Modified over 7 years ago
1
CS294-129: Designing, Visualizing and Understanding Deep Neural Networks
John Canny Fall 2016 Lecture 8: Training Neural Networks II Based on notes from Andrej Karpathy, Fei-Fei Li, Justin Johnson Where to put block multi-vector algorithms? Percentage of Future Work Diagram
2
Last time Convnets with Alyosha Efros
3
This lecture: wrap up of training deep networks
Last Wednesday: Activation Functions Data Preprocessing Weight Initialization Batch Normalization Babysitting the Learning process *
4
This lecture: wrap up of training deep networks
Hyperparameter optimization Ensembles Dropout One-bit gradients Gradient noise
5
Admin Project team formation and proposal due Monday 9/26
Please create a bCourses student group for your team Assignment 2 is out, due Friday 9/30 John Canny office hours 2:30-3:30 today.
6
Mini-batch SGD Loop: Sample a batch of data
Forward prop it through the graph, get loss Backprop to calculate the gradients Update the parameters using the gradient
7
Activation Functions Leaky ReLU max(0.1x, x) Sigmoid Maxout
tanh tanh(x) ELU ReLU max(0,x)
8
Data Preprocessing
9
Weight Initialization
“Xavier initialization” [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations)
10
Batch Normalization Improves gradient flow through the network
[Ioffe and Szegedy, 2015] Normalize: Improves gradient flow through the network Allows higher learning rates Reduces the strong dependence on initialization Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe And then allow the network to squash the range if it wants to:
11
Hyperparameter Optimization
12
Cross-validation strategy
I like to do coarse -> fine cross-validation in stages First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
13
For example: run coarse search for 5 epochs
note it’s best to optimize in log space! nice
14
Now run finer search... adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons.
15
Now run finer search... adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross-validation result is worrying. Why?
16
Random Search vs. Grid Search
Random Search for Hyper-Parameter Optimization Bergstra and Bengio, 2012
17
Hyperparameters to play with:
network architecture learning rate, its decay schedule, update type regularization (L2/Dropout strength) neural networks practitioner music = loss function
18
A cross-validation “command center”
19
Optimal Hyper-parameters
From “Gradient-based Hyperparameter Optimization through Reversible Learning” Dougal et al., 2015
20
Monitor and visualize the loss curve
21
Loss time
22
Loss Bad initialization a prime suspect time
23
lossfunctions.tumblr.com Loss function specimen
24
lossfunctions.tumblr.com
25
lossfunctions.tumblr.com
26
Other Features to capture and plot
Per-layer activations: Magnitude, center (mean or median), breadth (sdev or quartiles) Spatial/feature-rank variations Gradients Learning trajectories Plot parameter values in a low-dimensional space
27
Monitor and visualize the accuracy:
big gap = overfitting => increase regularization strength? no gap => increase model capacity?
28
Track the ratio of weight updates / weight magnitudes:
ratio between the values and updates: ~ / 0.02 = 0.01 (about okay) want this to be somewhere around or so
29
Ensemble Learning Ensemble: A model built from many simpler models.
Two main methods: Bagging: Boosting:
30
Ensemble Learning Ensemble: A model built from many simpler models
Two main methods: Bagging: (Bootstrap AGgregation): Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Boosting:
31
Ensemble Learning Ensemble: A model built from many simpler models
Two main methods: Bagging: (Bootstrap AGgregation): Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Boosting: Learners are ordered: Each learner tries to reduce error (residual) on “hard” examples (those misclassified by earlier learners).
32
Ensemble Learning Ensemble: A model built from many simpler models
Two main methods: Bagging: (Bootstrap AGgregation): Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Reduces variance in the prediction, not bias. Bootstrap sampling not always used – but some method is needed to generate diverse base models – e.g. random forests.
33
Ensemble Learning Ensemble: A model built from many simpler models
Two main methods: Bagging: (Bootstrap AGgregation): Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Boosting: Learners are ordered: Each learner tries to reduce error (residual) on “hard” examples (those misclassified by earlier learners). ADABOOST: weight hard samples more; GRADIENT BOOST: use residual to train later models. Reduces bias and possibly variance compared to base learners.
34
Ensemble Learning Ensemble: A model built from many simpler models
Two main methods: Bagging: (Bootstrap AGgregation): Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Boosting: Learners are ordered: Each learner tries to reduce error (residual) on “hard” examples (those misclassified by earlier learners). In both cases, the ensemble prediction is an evenly-weighted sum (or vote) of the base learner predictions. Aside: Stacking uses non-uniform base learner weighting.
35
Bagging on image classification tasks
Enjoy 2% extra accuracy
36
Ensemble Learning Gradient-boosted decision trees (GBDT) often gives state-of-the-art performance on simple classification tasks, e.g. XGBOOST. Neural networks are used fairly often with bagging, but rarely with boosting. Why do you think that is? Hint: Decision trees work well in both bagging and boosting.
37
Ensemble Learning Contrast: “Wide” vs “Deep” models (see “Wide & Deep Learning for Recommender Systems” by Cheng et al.)
38
Ensemble Learning Contrast: “Wide” vs “Deep” models (see “Wide & Deep Learning for Recommender Systems” by Cheng et al.) Deep networks good for global effects (in input feature space), wide networks better for local effects.
39
Gradient Boosting Gradient boosting: Input First model
Models progress from global local Input First model First residual (input – model1) Second model
40
Ensemble Learning Contrast: “Wide” vs “Deep” models (see “Wide & Deep Learning for Recommender Systems” by Cheng et al.) Deep networks good for global effects (in input feature space), wide networks better for local effects. In boosting, the base learners model global local effects. So boost with deep wide networks? - Open Question!
41
Fun Tips/Tricks: can also get a small boost from averaging multiple model checkpoints of a single model.
42
Fun Tips/Tricks: can also get a small boost from averaging multiple model checkpoints of a single model. keep track of (and use at test time) a running average parameter vector:
43
Regularization (dropout)
44
Regularization by Dropout
“randomly set some neurons to zero in the forward pass” [Srivastava et al., 2014]
45
Example forward pass with a 3-layer network using dropout
46
Waaaait a second… How could this possibly be a good idea?
47
How could this possibly be a good idea?
Waaaait a second… How could this possibly be a good idea? Forces the network to have a redundant representation. has an ear X has a tail is furry X cat score has claws mischievous look X
48
How could this possibly be a good idea?
Waaaait a second… How could this possibly be a good idea? Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~one datapoint.
49
At test time…. Ideally: want to integrate out all the noise
Monte Carlo approximation: do many forward passes with different dropout masks, average all predictions
50
At test time…. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). (this can be shown to be an approximation to evaluating the whole ensemble)
51
At test time…. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). Q: Suppose that with all inputs present at test time the output of this neuron is x. What would its output be during training time, in expectation? (e.g. if p = 0.5)
52
At test time…. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). during test: a = w0*x + w1*y during train: E[a] = ¼ * (w0*0 + w1*0 w0*0 + w1*y w0*x + w1*0 w0*x + w1*y) = ¼ * (2 w0*x + 2 w1*y) = ½ * (w0*x + w1*y) a w0 w1 x y
53
At test time…. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). during test: a = w0*x + w1*y during train: E[a] = ¼ * (w0*0 + w1*0 w0*0 + w1*y w0*x + w1*0 w0*x + w1*y) = ¼ * (2 w0*x + 2 w1*y) = ½ * (w0*x + w1*y) With p=0.5, using all inputs in the forward pass would inflate the activations by 2x from what the network was “used to” during training! => Have to compensate by scaling the activations back down by ½ a w0 w1 x y
54
We can do something approximate analytically
At test time all neurons are active always => We must scale the activations so that for each neuron: output at test time = expected output at training time
55
Dropout Summary drop in forward pass scale at test time
56
More common: “Inverted dropout”
test time is unchanged!
57
Back to gradients Standard gradient – easy to compute
W_2 Standard gradient – easy to compute Natural gradient – hard to compute W_1
58
Gradient Magnitudes: Gradients too big divergence
Gradients too small slow convergence
59
Gradient Magnitudes: Gradients too big divergence
Gradients too small slow convergence Divergence is much worse!
60
Gradient Magnitudes: What’s the simplest way to ensure gradients stay bounded?
61
Gradient clipping: Simply limit the magnitude of each gradient:
𝑔 𝑖 = min 𝑔 max ,max − 𝑔 max , 𝑔 𝑖 so 𝑔 𝑖 ≤ 𝑔 max . Then use a decreasing learning rate to converge to an optimum.
62
Extreme Gradient clipping:
Standard gradient Clipped gradient Gradient clipping limits the largest gradient dimensions, while others may be very small. ADAGRAD and RMSprop scale gradient dimensions by the inverse std deviation, so all dimension have unit sdev. What if we scale gradients up before clipping, so all dimensions are clipped?
63
One-bit gradients Standard gradient One-bit gradient
If we clip all gradient dimensions, we are left only with their sign: 𝑔 𝑖 = 𝑔 max −1,1,1,−1,1,… This actually works on some problems with little or no loss of accuracy: (see “1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs” by Seide et al. 2014)
64
Gradient Clipping again
Standard gradient One-bit gradient Though more commonly gradient clipping is used less aggressively (don’t saturate all dimensions) as a option to other algorithms like ADAGRAD, RMSprop.
65
Stochastic Gradient Descent
W_2 True gradients in blue minibatch gradients in red W_1 original W Gradients are noisy but still make good progress on average
66
Gradient Noise If a little noise is good, what about adding noise to gradients?
67
Gradient Noise If a little noise is good, what about adding noise to gradients? A: Works Great for many models! Is especially valuable for complex models that would overfit otherwise. “Adding Gradient Noise Improves Learning for Very Deep Networks” Arvind Neelakantan et al., 2016
68
Gradient Noise Schedule: where the noise variance is:
69
Gradient Noise Results on MNIST with a 20-layer ReLU network:
70
Gradient Noise Neural Programmer: Neural RAM: (learning a search task)
Neural GPUs: (learning arithmetic operations)
71
Gradient Noise Very interesting questions here:
Gradient noise turns model parameter into a Bayesian inference task: Noise magnitude controls “flatness” of the distribution. Adding momentum controls level-of-detail.
72
Gradient Noise + Momentum
Model Likelihood Increase Noise (Higher temperature) Model parameters Model Likelihood Increase Momentum Model parameters
73
Summary Hyperparameter optimization Ensembles Dropout
One-bit gradients Gradient noise
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.