Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Part 2: Unsupervised Learning
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Image Modeling & Segmentation
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
Expectation Maximization
Supervised Learning Recap
An Introduction to Variational Methods for Graphical Models.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Computer vision: models, learning and inference
Challenges posed by Structural Equation Models Thomas Richardson Department of Statistics University of Washington Joint work with Mathias Drton, UC Berkeley.
What is Statistical Modeling
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Piecewise Bounds for Estimating Bernoulli- Logistic Latent Gaussian Models Mohammad Emtiyaz Khan Joint work with Benjamin Marlin, and Kevin Murphy University.
x – independent variable (input)
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Today Logistic Regression Decision Trees Redux Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Today Wrap up of probability Vectors, Matrices. Calculus
Crash Course on Machine Learning
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
Probabilistic Graphical Models
Representation and Learning in Directed Mixed Graph Models Ricardo Silva Statistical Science/CSML, University College London Networks:
Mixed Cumulative Distribution Networks Ricardo Silva, Charles Blundell and Yee Whye Teh University College London AISTATS 2011 – Fort Lauderdale, FL.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CS Statistical Machine learning Lecture 24
An Introduction to Variational Methods for Graphical Models
Lecture 2: Statistical learning primer for biologists
Gaussian Processes For Regression, Classification, and Prediction.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Recitation4 for BigData Jay Gu Feb MapReduce.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Computer vision: models, learning and inference
Latent Variables, Mixture Models and EM
Roberto Battiti, Mauro Brunato
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hidden Markov Models Part 2: Algorithms
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
An Introduction to Variational Methods for Graphical Models
Support Vector Machines
Lecture 11 Generalizations of EM.
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani

Goal To model sparse distributions subject to marginal independence constraints

Why? X1 Y1 Y2Y3Y4Y5 X2

Why? X1 Y1 X2 Y2 X3 Y3 H 12 H 23

Why?

How? X1 Y1 X2 Y2 X3 Y3

How?

Context Y i = f i (X, Y) + E i, where E i is an error term E is not a vector of independent variables Assumed: sparse structure of marginally dependent/independent variables Goal: estimating E-like distributions

Why not latent variable models? Requires further decisions How many latents? Which children? Silly when marginal structure is sparse In the Bayesian case: – Drag down MCMC methods with (sometimes much) extra autocorrelation – Requires priors over parameters that you didn’t even care in the first place (Note: this talk is not about Bayesian inference)

Example

Bi-directed models: The story so far Gaussian models – Maximum likelihood (Drton and Richardson, 2003) – Bayesian inference (Silva and Ghahramani, 2006, 2008) Binary models – Maximum likelihood (Drton and Richardson, 2008)

New model: mixture of Gaussians Latent variables: mixture indicators – Assumed #levels is decided somewhere else No “real” latent variables

Caveat emptor I think you should buy this, but be warned that speed of computation is not the primary concern of this talk

Simple? Y1Y2Y3 C Y1, Y2, Y3 jointly Gaussian with sparse covariance matrix  c indexed by C

Not really Y1Y2Y3 C

Required: a factorial mixture of Gaussians c1 c2 c3

A parameterization under latent variables Assume Z variables are zero-mean Gaussians, C variables are binary

A parameterization under latent variables

Implied indexing  ij should be indexed only by those cliques containing both Yi and Yj

Implied indexing  i should be indexed only by those cliques containing Yi

Factorial mixture of Gaussians and the marginal independence model The general case for all latent structures Constraints: in the indexing whenever c and c’ agree on clique indicators in the intersection of the cliques with Yi and Yj

Factorial mixture of Gaussians and the marginal independence model The parameter pool (besides mixture probs.): – {  c[i] }, the mean vector – {  ii c[i] }, the variance vector – {  ij c[ij] }, the covariance vector For Yi and Yj linked in the bi-directed graph (since covariance is zero otherwise): marginal independence constraints Given c, we assemble the corresponding mean and covariance

Size of the parameter space Let L[i  j] be the size of the largest clique intersection, L[i] largest clique Let k be the maximum number of values any mixture indicator can take Let p be the number of variables, e the number of edges in bi-directed graph Total number of parameters: O(e k L[i  j] + pk L[i] )

Size of the parameter space Notice this is not a simple function of sparseness: – Dependence on the number of clique intersections – Non-sparse models can have few cliques In decomposable models, number of clique intersections is given by the branch factor of the junction tree

Maximum likelihood estimation An EM framework

Maximum likelihood estimation An EM framework

Algorithms First, solving the exact problem (exponential in the number of cliques) Constraints – Positive definite constraints – Marginal independence constraints Gradient-based methods – Moving in all dimensions quite unstable Violates constraints – Move over a subset while keeping part fixed

Iterative conditional fitting: Gaussian case See Drton and Richardson (2003) Choose some Yi Fix the covariance of Y \i Fit the covariance of Yi with Y \i, and its variance Marginal independence constraints introduced directly

 12 =  3,12 Iterative conditional fitting: Gaussian case Y1Y1 Y2Y2 Y3Y3  11  12  12  22 Y1Y1 Y2Y2 Y3Y3 11 2 22 33 3 Y1Y1 Y2Y2 Y3Y =  31  32 = 0   11 +  12 = = f(,  12 ) 23   b b b b b3b3 b b bb b b

Iterative conditional fitting: Gaussian case Y1Y1 Y2Y2 Y3Y3 Y1Y1 Y2Y2 Y3Y3 33 Y1Y1 Y2Y2 Y3Y3 23 R 2.1 Y3 = R  3, where R 2.1 is the residual of the regression of Y2 on Y1 23 b b

How does it change in the mixture of Gaussians case? Y1Y1 Y2Y2 Y3Y3 C 12 C 23 Y1Y1 Y2Y2 Y3Y3 Y1Y1 Y2Y2 Y3Y3 Y1Y1 Y2Y2 Y3Y3 C 12 C 23

Parameter expansion Yi = b 1 c R 1 + b 2 c R  c c does vary over all mixture indicators That is, we create an exponential number of parameters – Exponential in the number of cliques Where do the constraints go?

Parameter constraints Equality constraints are back Similar constraints for the variances b 1 c  R1j c + b 2 c  R2j c b k c  Rkj c = b 1 c’  R1j c’ + b 2 c’  R2j c’ b k c’  Rkj c’

Parameter constraints Variances of  c,  c, have to be positive Positive definiteness for all c is then guaranteed (Schur’s complement)

Constrained EM Maximize expected conditional of Y i given everybody else subjected to – An exponential number of constraints – An exponential number of parameters – Box constraints on gamma What does this buy us?

Removing parameters Covariance equality constraints are linear Even a naive approach can work: – Choose a basis for b (e.g., one b ij c corresponding to each non-zero  ij c[ij] ) – Basis is of tractable size (under sparseness) – Rewrite EM function as a function of basis only b 1 c  R1j c + b 2 c  R2j c b k c  Rkj c = b 1 c’  R1j c’ + b 2 c’  R2j c’ b k c’  Rkj c’

Quadratic constraints Equality of variances introduce quadratic constraints tying  s and bs Proposed solution: – fix all  ii c[i] first – fit b ij s with such fixed parameters Inequality constraints, non-convex optimization – Then fit  s given b – Number of parameters back to tractable Always an exponential number of constraints Note: reparameterization also takes an exponential number of steps

Relaxed optimization Optimization still expensive. What to do? Relaxed approach: fit bs ignoring variance equalities – Fix , fit b – Quadratic program, linear equality constraints I just solve it in closed formula –  s end up inconsistent Project them back to solution space without changing b – always possible May decrease expected log-likelihood – Fit  given bs Nonlinear programming, trivial constraints

Recap Iterative conditional fitting: maximize expected conditional log-likelihood Transform to other parameter space – Exact algorithm: quadratic inequality constraints “instead” of SD ones – Relaxed algorithm: no constraints No constraints?

Approximations Taking expectations is expensive what to do? Standard approximations use a ``nice’’  ’(c) – E.g., mean-field methods Not enough!

A simple approach The Budgeted Variational Approximation As simple as it gets: maximize a variational bound forcing most combinations of c to give a zero value to  ’(c) – Up to a pre-fixed budget How to choose which values? This guarantees positive-definitess only of those  (c) with non-zero conditionals  ’(c) – For predictions, project matrices first into PD cone

Under construction Implementation and evaluation of algorithms – (Lesson #1: never, ever, ever, use MATLAB quadratic programming methods) Control for overfitting – Regularization terms The first non-Gaussian directed graphical model fully closed under marginalization?

Under construction: Bayesian methods Prior: product of experts for covariance and variance entries (times a BIG indicator function) MCMC method: a M-H proposal based on the relaxed fitting algorithm – Is that going to work well? Problem is “doubly-intractable” – Not because of a partition function, but because of constraints – Are there any analogues to methods such as Murray/Ghahramani/McKay’s?

Thank You