CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Slides:

Advertisements

Similar presentations

Slides from: Doug Gray, David Poole

Advertisements

Neural networks Introduction Fitting neural networks

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

NEURAL NETWORKS Perceptron

also known as the “Perceptron”

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Kostas Kontogiannis E&CE

CSC321: Neural Networks Lecture 3: Perceptrons

Machine Learning Neural Networks

Lecture 14 – Neural Networks

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

Artificial Neural Networks

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

September 28, 2010Neural Networks Lecture 7: Perceptron Modifications 1 Adaline Schematic Adjust weights i1i1i1i1 i2i2i2i2 inininin …  w 0 + w 1 i 1 +

Image Compression Using Neural Networks Vishal Agrawal (Y6541) Nandan Dubey (Y6279)

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Neural Networks Lecture 8: Two simple learning algorithms

How to do backpropagation in a brain

MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way

Artificial Neural Networks

Classification Part 3: Artificial Neural Networks

Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10

Machine Learning Chapter 4. Artificial Neural Networks

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Non-Bayes classifiers. Linear discriminants, neural networks.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

CS621 : Artificial Intelligence

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

CSC321: Lecture 7:Ways to prevent overfitting

Neural Networks and Deep Learning Slides credit: Geoffrey Hinton and Yann LeCun.

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

EEE502 Pattern Recognition

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Neural Networks 2nd Edition Simon Haykin

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

1 Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory.

CSC321: Neural Networks Lecture 1: What are neural networks? Geoffrey Hinton

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Learning with Perceptrons and Neural Networks

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Lecture 11. MLP (III): Back-Propagation

Artificial Neural Networks

Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio

Neural networks (1) Traditional multi-layer perceptrons

Computer Vision Lecture 19: Object Recognition III

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Presentation transcript:

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton

Preprocessing the input vectors Instead of trying to predict the answer directly from the raw inputs we could start by extracting a layer of “features”. –Sensible if we already know that certain combinations of input values would be useful (e.g. edges or corners in an image). Instead of learning the features we could design them by hand. –The hand-coded features are equivalent to a layer of non-linear neurons that do not need to be learned. –So far as the learning algorithm is concerned, the activities of the hand-coded features are the input vector.

The connectivity of a perceptron The input is recoded using hand-picked features that do not adapt. Only the last layer of weights is learned. The output units are binary threshold neurons and are each learned independently. non-adaptive hand-coded features output units input units

Is preprocessing cheating? It seems like cheating if the aim to show how powerful learning is. The really hard bit is done by the preprocessing. Its not cheating if we learn the non-linear preprocessing. –This makes learning much more difficult and much more interesting.. Its not cheating if we use a very big set of non- linear features that is task-independent. –Support Vector Machines make it possible to use a huge number of features without requiring much computation or data.

What can perceptrons do? They can only solve tasks if the hand-coded features convert the original task into a linearly separable one. How difficult is this? The N-bit parity task : –Requires N features of the form: Are at least m bits on? –Each feature must look at all the components of the input. The 2-D connectedness task –requires an exponential number of features! The 7-bit parity task    1

Why connectedness is hard to compute Even for simple line drawings, there are exponentially many cases. Removing one segment can break connectedness –But this depends on the precise arrangement of the other pieces. –Unlike parity, there are no simple summaries of the other pieces that tell us what will happen. Connectedness is easy to compute with an serial algorithm. –Start anywhere in the ink –Propagate a marker –See if all the ink gets marked.

Learning with hidden units Networks without hidden units are very limited in the input-output mappings they can model. –More layers of linear units do not help. Its still linear. –Fixed output non-linearities are not enough We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets? –We need an efficient way of adapting all the weights, not just the last layer. This is hard. Learning the weights going into hidden units is equivalent to learning features. –Nobody is telling us directly what hidden units should do. (That’s why they are called hidden units).

Learning by perturbing weights Randomly perturb one weight and see if it improves performance. If so, save the change. –Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight. –Towards the end of learning, large weight perturbations will nearly always make things worse. We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. –Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others. Learning the hidden to output weights is easy. Learning the input to hidden weights is hard. hidden units output units input units

The idea behind backpropagation We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. – Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities. –Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined. –We can compute error derivatives for all the hidden units efficiently. –Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.

A change of notation For simple networks we use the notation x for activities of input units y for activities of output units z for the summed input to an output unit For networks with multiple hidden layers: y is used for the output of a unit in any layer x is the summed input to a unit in any layer The index indicates which layer a unit is in.

Non-linear neurons with smooth derivatives For backpropagation, we need neurons that have well-behaved derivatives. –Typically they use the logistic function –The output is a smooth function of the inputs and the weights Its odd to express it in terms of y.

Sketch of the backpropagation algorithm on a single training case First convert the discrepancy between each output and its target value into an error derivative. Then compute error derivatives in each hidden layer from error derivatives in the layer above. Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the weights.

The derivatives j i

Ways to use weight derivatives How often to update –after each training case? –after a full sweep through the training data? –after a “mini-batch” of training cases? How much to update –Use a fixed learning rate? –Adapt the learning rate? –Add momentum? –Don’t use steepest descent?

Overfitting The training data contains information about the regularities in the mapping from input to output. But it also contains noise –The target values may be unreliable. –There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. –So it fits both kinds of regularity. –If the model is very flexible it can model the sampling error really well. This is a disaster.

A simple example of overfitting Which model do you believe? –The complicated model fits the data better. –But it is not economical A model is convincing when it fits a lot of data surprisingly well. –It is not surprising that a complicated model can fit a small amount of data.