CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
Advertisements

CS590M 2008 Fall: Paper Presentation
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
CSC2515: Lecture 7 (prelude) Some linear generative models and a coding perspective Geoffrey Hinton.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Bayesian Learning Rong Jin.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
CIAR Summer School Tutorial Lecture 1a: Mixtures of Gaussians, EM, and Variational Free Energy Geoffrey Hinton.
How to do backpropagation in a brain
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 5: Mixture models, EM and variational inference All lecture slides will be available as.ppt,.ps,
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC2535: 2011 Advanced Machine Learning Lecture 2: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Ch9: Decision Trees 9.1 Introduction A decision tree:
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
The Elements of Statistical Learning
Machine Learning Basics
Structure learning with deep autoencoders
Creating Data Representations
INTRODUCTION TO Machine Learning
Feature space tansformation methods
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Computer Vision Lecture 19: Object Recognition III
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSC 578 Neural Networks and Deep Learning
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton

Three problems with backpropagation Where does the supervision come from? –Most data is unlabelled The vestibular-ocular reflex is an exception. How well does the learning time scale? –Its is impossible to learn features for different parts of an image independently if they all use the same error signal. Can neurons implement backpropagation? –Not in the obvious way. but getting derivatives from later layers is so important that evolution may have found a way. w1w1 w2w2 y

The Goals of Unsupervised Learning Without a desired output or reinforcement signal it is much less obvious what the goal is. Discover useful structure in large data sets without requiring a supervisory signal –Create representations that are better for subsequent supervised or reinforcement learning –Build a density model Classify by seeing which model likes the test case data most Monitor a complex system by noticing improbable states. –Extract interpretable factors (causes or constraints) Improve learning speed for high-dimensional inputs –Allow features within a layer to learn independently –Allow multiple layers to be learned greedily.

Self-supervised backpropagation Autoencoders define the desired output to be the same as the input. –Trivial to achieve with direct connections The identity is easy to compute! Its is useful if we can squeeze the information through some kind of bottleneck: –Only a few hidden units? Very similar to Principal Components Analysis –Only a single active hidden unit? This is clustering (mixture models). Its easy to consider all possible hidden configurations –Minimize information in hidden unit activities? How do we measure the information?

Self-supervised backprop and PCA If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. The m hidden units will span the same space as the first m principal components –Their weight vectors may not be orthogonal –They will tend to have equal variances

Self-supervised backprop and clustering If we force the hidden unit whose weight vector is closest to the input vector to have an activity of 1 and the rest to have activities of 0, we get clustering. –The weight vector of each hidden unit represents the center of a cluster. –Input vectors are reconstructed as the nearest cluster center. We need to tie the input->hidden weights to be the same as the hidden->output weights. –Usually, we cannot backpropagate through binary hidden units, but in this case the derivatives for the input->hidden weights all become zero! If the winner doesn’t change – no derivative The winner changes when two hidden units give exactly the same error – no derivative

An MDL approach to clustering sender receiver quantized data perfectly reconstructed data cluster parameters code for each datapoint data-misfit for each datapoint center of cluster

How many bits must we send? Model parameters: –It depends on the priors and how accurately they are sent. –Lets ignore these details for now Codes: –If all n clusters are equiprobable, log n This is extremely plausible, but wrong! –We can do it in less bits This is extremely implausible but right. Data misfits: –If sender & receiver assume a Gaussian distribution within the cluster, -log[p(d)|cluster] which depends on the squared distance of d from the cluster center.

Using a Gaussian agreed distribution (again!) Assume we need to send a value, x, with a quantization width of t This requires a number of bits that depends on

What is the best variance to use? It is obvious that this is minimized by setting the variance of the Gaussian to be the variance of the residuals.

Sending a value assuming a mixture of two equal Gaussians The point halfway between the two Gaussians should cost –log(p(x)) bits where p(x) is its density under one of the Gaussians. How can we make this compatible with the MDL story? x

The bits-back argument Consider a datapoint that is equidistant from two cluster centers. –The sender could code it relative to cluster 0 or relative to cluster 1. – Either way, the sender has to send one bit to say which cluster is being used. It seems like a waste to have to send a bit when you don’t care which cluster you use. It must be inefficient to have two different ways of encoding the same point. Gaussian 0 Gaussian 1 data

Using another message to make random decisions Suppose the sender is also trying to communicate another message –The other message is completely independent. –It looks like a random bit stream. Whenever the sender has to choose between two equally good ways of encoding the data, he uses a bit from the other message to make the decision After the receiver has losslessly reconstructed the original data, the receiver can pretend to be the sender. –This enables the receiver to figure out the random bit in the other message. So the original message cost one bit less than we thought because we also communicated a bit from another message.

The general case Gaussian 0 Gaussian 1 data Gaussian 2 Bits required to send cluster identity plus data relative to cluster center Random bits required to pick which cluster Probability of picking cluster i

What is the best distribution? The sender and receiver can use any distribution they like –But what distribution minimizes the expected message length The minimum occurs when we pick codes using a Boltzmann distribution: This gives the best trade-off between entropy and expected energy. –It is how physics behaves when there is a system that has many alternative configurations each of which has a particular energy (at a temperature of 1).

Free Energy Energy of configuration i Entropy of distribution over configurations Probability of finding system in configuration i The free energy of a set of configurations is the energy that a single configuration would have to have to have as much probability as that entire set. Temperature

A Canadian example Ice is a more regular and lower energy packing of water molecules than liquid water. –Lets assume all ice configurations have the same energy But there are vastly more configurations called water.

An MDL view of PCA PCA is a degenerate form of MDL –We ignore the cost of coding the model (the directions of the Principal Components). –We ignore the cost of coding the projections onto the principal components for each data point –We assume that the reconstruction errors are all coded using the same width of Gaussian. If we include the cost of coding the projections (using a Gaussian assumption) and we use the optimal variance of Gaussian for coding the residuals on each dimension in the reconstructions, we get Factor Analysis

A spectrum of representations PCA is powerful because it uses distributed representations but limited because its representations are linearly related to the data –Autoencoders with more hidden layers are not limited this way. Clustering is powerful because it uses very non-linear representations but limited because its representations are local (not componential). We need representations that are both distributed and non-linear –Unfortunately, these are typically very hard to learn. clustering PCA Linear non- linear Local Distributed What we need