CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.

Slides:

Advertisements

Similar presentations

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.

Perceptron Learning Rule

CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Support Vector Machines

Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.

Supervised Learning Recap

CSC321: Neural Networks Lecture 3: Perceptrons

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

How to do backpropagation in a brain

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

October 7, 2010Neural Networks Lecture 10: Setting Backpropagation Parameters 1 Creating Data Representations On the other hand, sets of orthogonal vectors.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Radial-Basis Function Networks

Radial Basis Function Networks

Neural Networks Lecture 8: Two simple learning algorithms

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.

How to do backpropagation in a brain

EM and expected complete log-likelihood Mixture of Experts

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

Chapter 9 Neural Network.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.

CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.

CSC321: Neural Networks Lecture 1: What are neural networks? Geoffrey Hinton

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

CSC2535 Lecture 5 Sigmoid Belief Nets

Soft Computing Lecture 15 Constructive learning algorithms. Network of Hamming.

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.

Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Machine Learning Supervised Learning Classification and Regression

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Ananya Das Christman CS311 Fall 2016

Data Mining, Neural Network and Genetic Programming

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

Unsupervised Learning: Principle Component Analysis

Creating Data Representations

Autoencoders David Dohan.

PYTHON Deep Learning Prof. Muhammad Saeed.

What is Artificial Intelligence?

Presentation transcript:

CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton

Three problems with backpropagation Where does the supervision come from? Most data is unlabelled The vestibular-ocular reflex is an exception. How well does the learning time scale? Its is impossible to learn features for different parts of an image independently if they all use the same error signal. Can neurons implement backpropagation? Not in the obvious way. but getting derivatives from later layers is so important that evolution may have found a way. y w1 w2

Three kinds of learning Supervised Learning: this models p(y|x) Learn to predict a real valued output or a class label from an input. Reinforcement learning: this just tries to have a good time Choose actions that maximize payoff Unsupervised Learning: this models p(x) Build a causal generative model that explains why some data vectors occur and not others or Learn an energy function that gives low energy to data and high energy to non-data Discover interesting features; separate sources that have been mixed together; find temporal invariants etc. etc.

The Goals of Unsupervised Learning The general goal of unsupervised learning is to discover useful structure in large data sets without requiring a target desired output or reinforcement signal. It is not obvious how to turn this general goal into a specific objective function that can be used to drive the learning. A more specific goal: Create representations that are better for subsequent supervised or reinforcement learning.

Why unsupervised pre-training makes sense stuff stuff high bandwidth low bandwidth image label image label If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway. If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity?

Another Goal of Unsupervised Learning Improve learning speed for high-dimensional inputs Allow features within a layer to learn independently Allow multiple layers to be learned greedily. Improve the speed of supervised learning by removing directions with high curvature of the error surface. This can be done by using Principal Components Analysis to pre-process the input vectors.

Another Goal of Unsupervised Learning Build a density model of the data vectors. This assigns a “score” or probability to each possible datavector. There are many ways to use a good density model. Classify by seeing which model likes the test case data most Monitor a complex system by noticing improbable states. Extract interpretable factors (causes or constraints).

Using backprop for unsupervised learning Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code. The bottleneck does not have room for redundant features. Good for extracting independent features (as in the family trees) output vector code input vector

Self-supervised backprop in a linear network If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. This is exactly what Principal Components Analysis does. The M hidden units will span the same space as the first M principal components found by PCA Their weight vectors may not be orthogonal They will tend to have equal variances

Principal Components Analysis This takes N-dimensional data and finds the M orthogonal directions in which the data have the most variance These M principal directions form a subspace. We can represent an N-dimensional datapoint by its projections onto the M principal directions This loses all information about where the datapoint is located in the remaining orthogonal directions. We reconstruct by using the mean value (over all the data) on the N-M directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences from the mean.

A picture of PCA with N=2 and M=1 The red point is represented by the green point. Our “reconstruction” of the red point has an error equal to the squared distance between red and green points. First principal component: Direction of greatest variance

Self-supervised backprop and clustering reconstruction If we force the hidden unit whose weight vector is closest to the input vector to have an activity of 1 and the rest to have activities of 0, we get clustering. The weight vector of each hidden unit represents the center of a cluster. Input vectors are reconstructed as the nearest cluster center. data=(x,y)

Clustering and backpropagation We need to tie the input->hidden weights to be the same as the hidden->output weights. Usually, we cannot backpropagate through binary hidden units, but in this case the derivatives for the input->hidden weights all become zero! If the winner doesn’t change – no derivative The winner changes when two hidden units give exactly the same error – no derivative So the only error-derivative is for the output weights. This derivative pulls the weight vector of the winning cluster towards the data point. When the weight vector is at the center of gravity of a cluster, the derivatives all balance out because the c. of g. minimizes squared error.

A spectrum of representations PCA is powerful because it uses distributed representations but limited because its representations are linearly related to the data Autoencoders with more hidden layers are not limited this way. Clustering is powerful because it uses very non-linear representations but limited because its representations are local (not componential). We need representations that are both distributed and non-linear Unfortunately, these are typically very hard to learn. Local Distributed PCA Linear non-linear clustering What we need