CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayesian inference Lee Harrison York Neuroimaging Centre 01 / 05 / 2009.

Independent Component Analysis

Probabilistic analog of clustering: mixture models

Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.

Dimension reduction (1)

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

(Includes references to Brian Clipp

CSC2515: Lecture 7 (prelude) Some linear generative models and a coding perspective Geoffrey Hinton.

Visual Recognition Tutorial

Pattern Recognition and Machine Learning

Dimensional reduction, PCA

Information Theory and Learning

Independent Component Analysis (ICA) and Factor Analysis (FA)

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.

A Unifying Review of Linear Gaussian Models

Lecture II-2: Probability Review

Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Survey on ICA Technical Report, Aapo Hyvärinen, 1999.

Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.

Computer vision: models, learning and inference Chapter 19 Temporal models.

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.

Modern Navigation Thomas Herring

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

An Introduction to Kalman Filtering by Arthur Pece

Bayes Theorem The most likely value of x derived from this posterior pdf therefore represents our inverse solution. Our knowledge contained in is explicitly.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.

An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.

Deep Feedforward Networks

Introduction to Machine Learning

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Latent Variables, Mixture Models and EM

PCA vs ICA vs LDA.

Principles of the Global Positioning System Lecture 11

Parametric Methods Berlin Chen, 2005 References:

Independent Factor Analysis

Curse of Dimensionality

Linear Discrimination

Presentation transcript:

CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton

Factor Analysis The generative model for factor analysis assumes that the data was produced in three stages: –Pick values independently for some hidden factors that have Gaussian priors –Linearly combine the factors using a factor loading matrix. Use more linear combinations than factors. –Add Gaussian noise that is different for each input. i j

A degeneracy in Factor Analysis We can always make an equivalent model by applying a rotation to the factors and then applying the inverse rotation to the factor loading matrix. –The data does not prefer any particular orientation of the factors. This is a problem if we want to discover the true causal factors. –Psychologists wanted to use scores on intelligence tests to find the independent factors of intelligence.

What structure does FA capture? Factor analysis only captures pairwise correlations between components of the data. –It only depends on the covariance matrix of the data. –It completely ignores higher-order statistics Consider the dataset: 111, 100, 010, 001 This has no pairwise correlations but it does have strong third order structure.

Using a non-Gaussian prior If the prior distributions on the factors are not Gaussian, some orientations will be better than others –It is better to generate the data from factor values that have high probability under the prior. – one big value and one small value is more likely than two medium values that have the same sum of squares. If the prior for each hidden activity is the iso-probability contours are straight lines at 45 degrees.

The square, noise-free case We eliminate the noise model for each data component, and we use the same number of factors as data components. Given the weight matrix, there is now a one-to-one mapping between data vectors and hidden activity vectors. To make the data probable we want two things: –The hidden activity vectors that correspond to data vectors should have high prior probabilities. –The mapping from hidden activities to data vectors should compress the hidden density to get high density in the data space. i.e. the matrix that maps hidden activities to data vectors should have a small determinant. Its inverse should have a big determinant

The ICA density model Assume the data is obtained by linearly mixing the sources The filter matrix is the inverse of the mixing matrix. The sources have independent non-Gaussian priors. The density of the data is a product of source priors and the determinant of the filter matrix Mixing matrix Source vector

The information maximization view of ICA Filter the data linearly and then applying a non- linear “squashing” function. The aim is to maximize the information that the outputs convey about the input. –Since the outputs are a deterministic function of the inputs, information is maximized by maximizing the entropy of the output distribution. This involves maximizing the individual entropies of the outputs and minimizing the mutual information between outputs.

The “outputs” are squashed linear combinations of inputs. The entropy of the outputs can be re-expressed in the input space. –Maximizing entropy is minimizing this KL divergence! J is the Jacobian of the filter matrix – just like in backprop. Empirical distribution Model’s distribution

How the squashing function relates to the non-Gaussian prior density for the sources We want the entropy maximization view to be equivalent to maximizing the likelihood of a linear generative model. So treat the derivative of the squashing function as the prior density. –This works nicely for the logistic function. It even integrates to

Overcomplete ICA What if we have more independent sources than data components? (independent \= orthogonal) –The data no longer specifies a unique vector of source activities. It specifies a distribution. This also happens if we have sensor noise in square case. –The posterior over sources is non-Gaussian because the prior is non-Gaussian. So we need to approximate the posterior: –MCMC samples –MAP (plus Gaussian around MAP?) –Variational