CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Advertisements

Slides from: Doug Gray, David Poole
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Neural Networks for Machine Learning Lecture 16a Learning a joint.
How to do backpropagation in a brain
EM and expected complete log-likelihood Mixture of Experts
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Low Level Visual Processing. Information Maximization in the Retina Hypothesis: ganglion cells try to transmit as much information as possible about the.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
Csc Lecture 8 Modeling image covariance structure Geoffrey Hinton.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
1 Computational Vision CSCI 363, Fall 2012 Lecture 16 Stereopsis.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Biointelligence Laboratory, Seoul National University
Visualizing High-Dimensional Data
Deep Feedforward Networks
Computational Vision CSCI 363, Fall 2016 Lecture 15 Stereopsis
LECTURE 11: Advanced Discriminant Analysis
Data Mining, Neural Network and Genetic Programming
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Neural Networks.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Machine Learning Basics
Dynamical Statistical Shape Priors for Level Set Based Tracking
Convolutional Networks
Goodfellow: Chapter 14 Autoencoders
Deep Learning for Non-Linear Control
CSC2535: Computation in Neural Networks Lecture 13: Representing things with neurons Geoffrey Hinton.
Detecting image intensity changes
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Image recognition.
Support Vector Machines 2
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton

The aims of unsupervised learning We would like to extract a representation of the sensory input that is useful for later processing. We want to do this without requiring labeled data. Prior ideas about what the internal representation should look like ought to be helpful. So what would we like in a representation? Hidden causes that explain high-order correlations? Constraints that often hold? A low-dimensional manifold that contains all the data Properties that are invariant across space or time?

Temporally invariant properties Consider a rigid object that is moving relative to the retina: Its retinal image changes in predictable ways Its true 3-D shape stays exactly the same. It is invariant over time. Its angular momentum also stays the same if it is in free fall. Properties that are invariant over time are usually interesting.

Spatially invariant properties Consider a smooth surface covered in random dots that is viewed from two different directions: Each image is just a set of random dots. A stereo pair of images has disparity that changes smoothly over space. Nearby regions of the image pair have very similar disparities. plane of fixation left eye right eye surface

Learning temporal invariances maximize agreement non-linear features non-linear features hidden layers hidden layers image image time t+1 time t

A new way to get a teaching signal Each module uses the output of the other module as the teaching signal. This does not work if the two modules can see the same data. They just report one component of the data and agree perfectly. It also fails if a module always outputs a constant. The modules can just ignore the data and agree on what constant to output. We need a sensible definition of the amount of agreement between the outputs.

Mutual information Two variables, a and b, have high mutual information if you can predict a lot about one from the other. Mutual Information Individual entropies Joint entropy There is also an asymmetric way to define mutual information: Compute derivatives of I w.r.t. the feature activities. Then backpropagate to get derivatives for all the weights in the network. The network at time t is using the network at time t+1 as its teacher (and vice versa).

Some advantages of mutual information If the modules output constants the mutual information is zero. If the modules each output a vector, the mutual information is maximized by making the components of each vector be as independent as possible. Mutual information exactly captures what we mean by “agreeing”.

A problem We can never have more mutual information between the two output vectors than there is between the two input vectors. So why not just use the input vector as the output? We want to preserve as much mutual information as possible whilst also achieving something else: Dimensionality reduction? A simple form for the prediction of one output from the other?

A simple form for the relationship Assume the output of module a equals the output of module b plus noise: If we assume that a and b are both noisy versions of the same underlying signal we can use

Learning temporal invariances Backpropagate derivatives Backpropagate derivatives maximize mutual information non-linear features non-linear features hidden layers hidden layers image image time t+1 time t

Maximizing mutual information between a local region and a larger context Contextual prediction w1 w2 w3 w4 Maximize MI hidden hidden hidden hidden hidden left eye right eye surface

How well does it work? If we use weight sharing between modules and plenty of hidden units, it works really well. It extracts the depth of the surface fairly accurately. It simultaneously learns the optimal weights of -1/6, +4/6, +4/6, -1/6 for interpolating the depths of the context to predict the depth at the middle module. If the data is noisy or the modules are unreliable it learns a more robust interpolator that uses smaller weights in order not to amplify noise.

But what about discontinuities? Real surfaces are mostly smooth but also have sharp discontinuities in depth. How can we preserve the high mutual information between local depth and contextual depth? Discontinuities cause occasional high residual errors. The Gaussian model of residuals requires high variance to accommodate these large errors.

A simple mixture approach We assume that there are “continuity” cases in which there is high MI and “discontinuity” cases in which there is no MI. The variance of the residual is only computed on the continuity cases so it can stay small. The residual can be used to compute the posterior probability of each type of case. Aim to maximize the mixing proportion of the continuity cases times the MI in those cases.

Mixtures of expert interpolators Instead of just giving up on discontinuity cases we can use a different interpolator that ignores the surface beyond the discontinuity To predict the depth at c use –a + 2b To choose this interpolator, find the location of the discontinuity. a b c d e

The mixture of interpolators net There are five interpolators, each with its own controller. Each controller is a neural net that looks at the outputs of all 5 modules and learns to detect a discontinuity at a particular location. Except for the controller for the full interpolator which checks that there is no discontinuity. The mixture of expert interpolators trains the controllers and the interpolators and the local depth modules all together.

Mutual Information with multidimensional output For a multidimensional Gaussian, the entropy is given by the determinant. If we use the identity model of the relationship between the outputs of two modules we get If we assume the outputs are jointly Gaussian we get

Relationship to linear dynamical system features linear features The past Linear model (could be the identity plus noise) image We predict in this domain image time t+1 time t