Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Visual Recognition Tutorial
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Canonical correlations
Independent Component Analysis (ICA) and Factor Analysis (FA)
Maximum-Likelihood estimation Consider as usual a random sample x = x 1, …, x n from a distribution with p.d.f. f (x;  ) (and c.d.f. F(x;  ) ) The maximum.
Bayesian belief networks 2. PCA and ICA
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Visual Recognition Tutorial
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Techniques for studying correlation and covariance structure
Correlation. The sample covariance matrix: where.
Modern Navigation Thomas Herring
Today Wrap up of probability Vectors, Matrices. Calculus
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Wireless Communication Elec 534 Set IV October 23, 2007
Summarized by Soo-Jin Kim
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Sheng-Fang Huang. 4.0 Basics of Matrices and Vectors Most of our linear systems will consist of two ODEs in two unknown functions y 1 (t), y 2 (t),
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PHYS 773: Quantum Mechanics February 6th, 2012
Optimal Component Analysis Optimal Linear Representations of Images for Object Recognition X. Liu, A. Srivastava, and Kyle Gallivan, “Optimal linear representations.
Chapter 28 Cononical Correction Regression Analysis used for Temperature Retrieval.
Matrix Notation for Representing Vectors
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Principal Component Analysis (PCA)
CHE 391 T. F. Edgar Spring 2012.
Latent Variables, Mixture Models and EM
Techniques for studying correlation and covariance structure
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presented by Nagesh Adluru
Chapter 3 Linear Algebra
Feature space tansformation methods
Equivalent State Equations
X.2 Linear Discriminant Analysis: 2-Class
LECTURE 09: DISCRIMINANT ANALYSIS
Probabilistic Surrogate Models
Presentation transcript:

Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss

preview Information Bottleneck/distortion –Was mainly studied in the discrete case (categorical variables) –Solutions are characterized analytically by self consistent equations, but obtained numerically (local maxima). We describe a complete analytic solution for the Gaussian case. –Reveal the connection with known statistical methods –Analytic characterization of the compression-information tradeoff curve

IB with continuous variables Extracting relevant features of continuous variables: –Result of analogue measurements: gene expression vs heat or chemical conditions –Continuous low dim manifolds: face expressions, postures IB formulation is not limited to discrete variables –Use continuous mutual information and entropies –In our case the problem contains an inherent scale, which makes all quantities well defined. The general continuous solutions are characterized by the self consistent equations –but this case is very difficult to solve

Gaussian IB Definition: Let X and Y be jointly Gaussian (multivariate) Search for another variable T that minimizes Min L = I(X;T) – βI(T;Y) T The optimal T is jointly Gaussian with X and Y. Equivalent formulation: T can always be represented as T = A X + ξ (with ξ~N(0,Σ ξ ), A= Σtx Σx -1 ) Minimize L over the A and ξ. The goal: Find optimum for all beta values

Before we start: What types of solutions do we expect? Second order correlation only: probably eigenvectors of some correlation matrices… -but which? The parameter β effects the model complexity: Probably deterine the number of eigen vectors and their scale… - but how?

Derive the solution Using the entropy of a Gaussian we write the target function Although L is a function of A and Σ ξ, there is always an equivalent solution A’ with spherized noise Σ ξ =I, that lead to same L value. Differentiate L w.r.t. A (matrix derivatives)

When A is a single row vector can be written as This has two types of solution: –A degenerates to zero –A is an eigenvector of M=Σ x|y Σ x -1 The scalar T case scalar λ A= A M

The eigenvector solution… 1) Is feasible only if:2) Has norm: β λ β λ The optimum is obtained with the smallest eigenvalues Conclusion:

The effect of β in the scalar case Plot the surface of the target L as a function of A, when A is a 1x2 vector:

The multivariate case Back to The rows of A are in the span of several eigenvectors. An optimal solution is achieved with the smallest eigenvectors. As β increases A goes through a series of transitions, each adding another eigen vector

The multivariate case Reverse water filling effect: increasing complexity causes a series of phase transitions β -1 1-λ

The multivariate case Reverse water filling effect: increasing complexity causes a series of phase transitions β -1 1-λ

The information curve Can be calculated analytically, as a function of the eigenvalue spectrum n I is the number of components required to obtain I(T;X). The curve is made of segments The tangent at critical points equals 1-λ

Relation to Canonical correlation analysis The eigenvectors used in GIB are also used in CCA [Hotelling 1935]. Given two Gaussian variables {X,Y}, CCA finds basis vectors for both X and Y that maximize correlation on their projections (i.e. bases for which the correlation matrix is diagonal with maximal correlations on the diagonal) –GIB controls the level of compression, providing both the number and scale of the vectors (per β). –CCA is a normalized measure, invariant to rescaling of the projection.

What did we gain? Specific cases coincide with known problems: A unified approach allows to reuse algorithms and proofs. categorical continuous Gaussian K-means ML for mixtures? CCA

What did we gain ? Revealed connection allows to gain from both fields: CCA => GIB –Statistical significance for sampled distributions Slonim and Weiss showed a connection between the β and the number of samples. What will be the relation here? GIB => CCA –CCA as a special case of a generic optimization principle –Generalizations of IB, lead to generalizations of CCA Multivariate IB => Multivariate CCA IB with side information => CCA with side information (as in oriented PCA) generalized eigen value problems. –Iterative algorithms (avoid the costly calculation of covariance matrices)

Summary We solve analytically the IB problem for Gaussian variables Solutions described in terms of eigenvectors of a normalized cross correlation matrix, and its norm as a function of the regularization parameter beta. Solutions are related to canonical correlation analysis Possible extensions to general exponential families and multivariate CCA.