CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton
The aims of unsupervised learning We would like to extract a representation of the sensory input that is useful for later processing. We want to do this without requiring labeled data. Prior ideas about what the internal representation should look like ought to be helpful. So what would we like in a representation? Hidden causes that explain high-order correlations? Constraints that often hold? A low-dimensional manifold that contains all the data Properties that are invariant across space or time?
Temporally invariant properties Consider a rigid object that is moving relative to the retina: Its retinal image changes in predictable ways Its true 3-D shape stays exactly the same. It is invariant over time. Its angular momentum also stays the same if it is in free fall. Properties that are invariant over time are usually interesting.
Spatially invariant properties Consider a smooth surface covered in random dots that is viewed from two different directions: Each image is just a set of random dots. A stereo pair of images has disparity that changes smoothly over space. Nearby regions of the image pair have very similar disparities. plane of fixation left eye right eye surface
Learning temporal invariances maximize agreement non-linear features non-linear features hidden layers hidden layers image image time t+1 time t
A new way to get a teaching signal Each module uses the output of the other module as the teaching signal. This does not work if the two modules can see the same data. They just report one component of the data and agree perfectly. It also fails if a module always outputs a constant. The modules can just ignore the data and agree on what constant to output. We need a sensible definition of the amount of agreement between the outputs.
Mutual information Two variables, a and b, have high mutual information if you can predict a lot about one from the other. Mutual Information Individual entropies Joint entropy There is also an asymmetric way to define mutual information: Compute derivatives of I w.r.t. the feature activities. Then backpropagate to get derivatives for all the weights in the network. The network at time t is using the network at time t+1 as its teacher (and vice versa).
Some advantages of mutual information If the modules output constants the mutual information is zero. If the modules each output a vector, the mutual information is maximized by making the components of each vector be as independent as possible. Mutual information exactly captures what we mean by “agreeing”.
A problem We can never have more mutual information between the two output vectors than there is between the two input vectors. So why not just use the input vector as the output? We want to preserve as much mutual information as possible whilst also achieving something else: Dimensionality reduction? A simple form for the prediction of one output from the other?
A simple form for the relationship Assume the output of module a equals the output of module b plus noise: If we assume that a and b are both noisy versions of the same underlying signal we can use
Learning temporal invariances Backpropagate derivatives Backpropagate derivatives maximize mutual information non-linear features non-linear features hidden layers hidden layers image image time t+1 time t
Maximizing mutual information between a local region and a larger context Contextual prediction w1 w2 w3 w4 Maximize MI hidden hidden hidden hidden hidden left eye right eye surface
How well does it work? If we use weight sharing between modules and plenty of hidden units, it works really well. It extracts the depth of the surface fairly accurately. It simultaneously learns the optimal weights of -1/6, +4/6, +4/6, -1/6 for interpolating the depths of the context to predict the depth at the middle module. If the data is noisy or the modules are unreliable it learns a more robust interpolator that uses smaller weights in order not to amplify noise.
But what about discontinuities? Real surfaces are mostly smooth but also have sharp discontinuities in depth. How can we preserve the high mutual information between local depth and contextual depth? Discontinuities cause occasional high residual errors. The Gaussian model of residuals requires high variance to accommodate these large errors.
A simple mixture approach We assume that there are “continuity” cases in which there is high MI and “discontinuity” cases in which there is no MI. The variance of the residual is only computed on the continuity cases so it can stay small. The residual can be used to compute the posterior probability of each type of case. Aim to maximize the mixing proportion of the continuity cases times the MI in those cases.
Mixtures of expert interpolators Instead of just giving up on discontinuity cases we can use a different interpolator that ignores the surface beyond the discontinuity To predict the depth at c use –a + 2b To choose this interpolator, find the location of the discontinuity. a b c d e
The mixture of interpolators net There are five interpolators, each with its own controller. Each controller is a neural net that looks at the outputs of all 5 modules and learns to detect a discontinuity at a particular location. Except for the controller for the full interpolator which checks that there is no discontinuity. The mixture of expert interpolators trains the controllers and the interpolators and the local depth modules all together.
Mutual Information with multidimensional output For a multidimensional Gaussian, the entropy is given by the determinant. If we use the identity model of the relationship between the outputs of two modules we get If we assume the outputs are jointly Gaussian we get
Relationship to linear dynamical system features linear features The past Linear model (could be the identity plus noise) image We predict in this domain image time t+1 time t