CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

Neural networks Introduction Fitting neural networks
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Artificial Neural Networks
Simple Neural Nets For Pattern Classification
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 8: Modeling text using a recurrent neural network trained with a really fancy.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 5 NEURAL NETWORKS
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Model Selection and Validation
Chapter 6: Multilayer Neural Networks
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Artificial Neural Networks
Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
CSC321: Lecture 7:Ways to prevent overfitting
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Machine Learning Supervised Learning Classification and Regression
Regularization Techniques in Neural Networks
Fall 2004 Backpropagation CS478 - Machine Learning.
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
The Gradient Descent Algorithm
Computer Science and Engineering, Seoul National University
A Simple Artificial Neuron
Generalization ..
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
Collaborative Filtering Matrix Factorization Approach
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Multilayer Perceptron & Backpropagation
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Neural networks (1) Traditional multi-layer perceptrons
Computer Vision Lecture 19: Object Recognition III
Presentation transcript:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton

Five ways to speed up learning Use an adaptive global learning rate –Increase the rate slowly if its not diverging –Decrease the rate quickly if it starts diverging Use separate adaptive learning rate on each connection –Adjust using consistency of gradient on that weight axis Use momentum –Instead of using the gradient to change the position of the weight “particle”, use it to change the velocity. Use a stochastic estimate of the gradient from a few cases –This works very well on large, redundant datasets. Don’t go in the direction of steepest descent. –The gradient does not point at the minimum. Can we preprocess the data or do something to the gradient so that we move directly towards the minimum?

The momentum method Imagine a ball on the error surface with velocity v. –It starts off by following the gradient, but once it has velocity, it no longer does steepest descent. It damps oscillations by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient. On an inclined plane it reaches a terminal velocity.

Adaptive learning rates on each connection Use a global learning rate multiplied by a local gain on each connection. Increase the local gains if the gradient does not change sign. Use additive increases and multiplicative decreases. –This ensures that big learning rates decay rapidly when oscillations start.

Batch learning does steepest descent on the error surface Online learning updates the weights after each training case. It zig-zags around the direction of steepest descent. w1 w2 w1 w2 Online versus batch learning constraint from training case 1 constraint from training case 2

Stochastic gradient descent If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half. –So instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half. –The extreme version is to update the weights after each example, but balanced mini- batches are just as good and faster in matlab.

Extra problems that occur in multilayer non- linear networks If we start with a big learning rate, the bias and all of the weights for one of the output units may become very negative. –The output unit is then very firmly off and it will never produce a significant error derivative. –So it will never recover (unless we have weight- decay). In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit produce an output equal to the proportion of time it should be a 1. –The network finds this strategy quickly and takes a long time to improve on it. So it looks like a local minimum.

Overfitting The training data contains information about the regularities in the mapping from input to output. But it also contains noise –The target values may be unreliable. –There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. –So it fits both kinds of regularity. –If the model is very flexible it can model the sampling error really well. This is a disaster.

Preventing overfitting Use a model that has the right capacity: –enough to model the true regularities –not enough to also model the spurious regularities (assuming they are weaker). Standard ways to limit the capacity of a neural net: –Limit the number of hidden units. –Limit the size of the weights. –Stop the learning before it has time to overfit.

Limiting the size of the weights Weight-decay involves adding an extra term to the cost function that penalizes the squared weights. –Keeps weights small unless they have big error derivatives. w C

Weight-decay via noisy inputs Weight-decay reduces the effect of noise in the inputs. –The noise variance is amplified by the squared weight The amplified noise makes an additive contribution to the squared error. –So minimizing the squared error tends to minimize the squared weights when the inputs are noisy. It gets more complicated for non-linear networks. i j

Other kinds of weight penalty Sometimes it works better to penalize the absolute values of the weights. –This makes some weights equal to zero which helps interpretation. Sometimes it works better to use a weight penalty that has negligible effect on large weights. 0 0

The effect of weight-decay It prevents the network from using weights that it does not need. –This can often improve generalization a lot. –It helps to stop it from fitting the sampling error. –It makes a smoother model in which the output changes more slowly as the input changes. w If the network has two very similar inputs it prefers to put half the weight on each rather than all the weight on one. w/ 2 w 0

Deciding how much to restrict the capacity How do we decide which limit to use and how strong to make the limit? –If we use the test data we get an unfair prediction of the error rate we would get on new test data. –Suppose we compared a set of models that gave random results, the best one on a particular dataset would do better than chance. But it wont do better than chance on another test set. So use a separate validation set to do model selection.

Using a validation set Divide the total dataset into three subsets: –Training data is used for learning the parameters of the model. –Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best. –Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data. We could then re-divide the total dataset to get another unbiased estimate of the true error rate.

Preventing overfitting by early stopping If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay. It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse (but don’t get fooled by noise!) The capacity of the model is limited because the weights have not had time to grow big.

Why early stopping works When the weights are very small, every hidden unit is in its linear range. –So a net with a large layer of hidden units is linear. –It has no more capacity than a linear net in which the inputs are directly connected to the outputs! As the weights grow, the hidden units start using their non-linear ranges so the capacity grows. outputs inputs