CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton

Three problems with backpropagation Where does the supervision come from? –Most data is unlabelled The vestibular-ocular reflex is an exception. How well does the learning time scale? –Its is impossible to learn features for different parts of an image independently if they all use the same error signal. Can neurons implement backpropagation? –Not in the obvious way. but getting derivatives from later layers is so important that evolution may have found a way. w1w1 w2w2 y

The Goals of Unsupervised Learning Without a desired output or reinforcement signal it is much less obvious what the goal is. Discover useful structure in large data sets without requiring a supervisory signal –Create representations that are better for subsequent supervised or reinforcement learning –Build a density model Classify by seeing which model likes the test case data most Monitor a complex system by noticing improbable states. –Extract interpretable factors (causes or constraints) Improve learning speed for high-dimensional inputs –Allow features within a layer to learn independently –Allow multiple layers to be learned greedily.

Self-supervised backpropagation Autoencoders define the desired output to be the same as the input. –Trivial to achieve with direct connections The identity is easy to compute! Its is useful if we can squeeze the information through some kind of bottleneck: –Only a few hidden units? Very similar to Principal Components Analysis –Only a single active hidden unit? This is clustering (mixture models). Its easy to consider all possible hidden configurations –Minimize information in hidden unit activities? How do we measure the information?

Self-supervised backprop and PCA If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. The m hidden units will span the same space as the first m principal components –Their weight vectors may not be orthogonal –They will tend to have equal variances

Self-supervised backprop and clustering If we force the hidden unit whose weight vector is closest to the input vector to have an activity of 1 and the rest to have activities of 0, we get clustering. –The weight vector of each hidden unit represents the center of a cluster. –Input vectors are reconstructed as the nearest cluster center. We need to tie the input->hidden weights to be the same as the hidden->output weights. –Usually, we cannot backpropagate through binary hidden units, but in this case the derivatives for the input->hidden weights all become zero! If the winner doesn’t change – no derivative The winner changes when two hidden units give exactly the same error – no derivative

An MDL approach to clustering sender receiver quantized data perfectly reconstructed data cluster parameters code for each datapoint data-misfit for each datapoint center of cluster

How many bits must we send? Model parameters: –It depends on the priors and how accurately they are sent. –Lets ignore these details for now Codes: –If all n clusters are equiprobable, log n This is extremely plausible, but wrong! –We can do it in less bits This is extremely implausible but right. Data misfits: –If sender & receiver assume a Gaussian distribution within the cluster, -log[p(d)|cluster] which depends on the squared distance of d from the cluster center.

Using a Gaussian agreed distribution (again!) Assume we need to send a value, x, with a quantization width of t This requires a number of bits that depends on

What is the best variance to use? It is obvious that this is minimized by setting the variance of the Gaussian to be the variance of the residuals.

Sending a value assuming a mixture of two equal Gaussians The point halfway between the two Gaussians should cost –log(p(x)) bits where p(x) is its density under one of the Gaussians. How can we make this compatible with the MDL story? x

The bits-back argument Consider a datapoint that is equidistant from two cluster centers. –The sender could code it relative to cluster 0 or relative to cluster 1. – Either way, the sender has to send one bit to say which cluster is being used. It seems like a waste to have to send a bit when you don’t care which cluster you use. It must be inefficient to have two different ways of encoding the same point. Gaussian 0 Gaussian 1 data

Using another message to make random decisions Suppose the sender is also trying to communicate another message –The other message is completely independent. –It looks like a random bit stream. Whenever the sender has to choose between two equally good ways of encoding the data, he uses a bit from the other message to make the decision After the receiver has losslessly reconstructed the original data, the receiver can pretend to be the sender. –This enables the receiver to figure out the random bit in the other message. So the original message cost one bit less than we thought because we also communicated a bit from another message.

The general case Gaussian 0 Gaussian 1 data Gaussian 2 Bits required to send cluster identity plus data relative to cluster center Random bits required to pick which cluster Probability of picking cluster i

What is the best distribution? The sender and receiver can use any distribution they like –But what distribution minimizes the expected message length The minimum occurs when we pick codes using a Boltzmann distribution: This gives the best trade-off between entropy and expected energy. –It is how physics behaves when there is a system that has many alternative configurations each of which has a particular energy (at a temperature of 1).

Free Energy Energy of configuration i Entropy of distribution over configurations Probability of finding system in configuration i The free energy of a set of configurations is the energy that a single configuration would have to have to have as much probability as that entire set. Temperature

A Canadian example Ice is a more regular and lower energy packing of water molecules than liquid water. –Lets assume all ice configurations have the same energy But there are vastly more configurations called water.

An MDL view of PCA PCA is a degenerate form of MDL –We ignore the cost of coding the model (the directions of the Principal Components). –We ignore the cost of coding the projections onto the principal components for each data point –We assume that the reconstruction errors are all coded using the same width of Gaussian. If we include the cost of coding the projections (using a Gaussian assumption) and we use the optimal variance of Gaussian for coding the residuals on each dimension in the reconstructions, we get Factor Analysis

A spectrum of representations PCA is powerful because it uses distributed representations but limited because its representations are linearly related to the data –Autoencoders with more hidden layers are not limited this way. Clustering is powerful because it uses very non-linear representations but limited because its representations are local (not componential). We need representations that are both distributed and non-linear –Unfortunately, these are typically very hard to learn. clustering PCA Linear non- linear Local Distributed What we need

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

Similar presentations

Presentation on theme: "CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

Similar presentations

Presentation on theme: "CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton."— Presentation transcript:

Similar presentations

About project

Feedback