Representation and Learning in Directed Mixed Graph Models Ricardo Silva Statistical Science/CSML, University College London Networks:

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

A Tutorial on Learning with Bayesian Networks
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Brief introduction on Logistic Regression
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Modelling with parameter- mixture copulas October 2006 Xiangyuan Tommy Chen Econometrics & Business Statistics The University of Sydney
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Improving health worldwide George B. Ploubidis The role of sensitivity analysis in the estimation of causal pathways from observational.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Dynamic Bayesian Networks (DBNs)
Pair-copula constructions of multiple dependence Workshop on ''Copulae: Theory and Practice'' Weierstrass Institute for Applied Analysis and.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Challenges posed by Structural Equation Models Thomas Richardson Department of Statistics University of Washington Joint work with Mathias Drton, UC Berkeley.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
1 gR2002 Peter Spirtes Carnegie Mellon University.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture II-2: Probability Review
Radial Basis Function Networks
Super-Resolution of Remotely-Sensed Images Using a Learning-Based Approach Isabelle Bégin and Frank P. Ferrie Abstract Super-resolution addresses the problem.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Bayesian Networks Martin Bachler MLA - VO
Mixed Cumulative Distribution Networks Ricardo Silva, Charles Blundell and Yee Whye Teh University College London AISTATS 2011 – Fort Lauderdale, FL.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Thinning Measurement Models and Questionnaire Design Ricardo Silva University College London CSML PI Meeting, January 2013.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Randomized Algorithms for Bayesian Hierarchical Clustering
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Conditional Probability Distributions Eran Segal Weizmann Institute.
CS Statistical Machine learning Lecture 24
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Pattern Recognition and Machine Learning
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
CS 2750: Machine Learning Directed Graphical Models
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Non-Parametric Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Representation and Learning in Directed Mixed Graph Models Ricardo Silva Statistical Science/CSML, University College London Networks: Processes and Causality, Menorca 2012

Graphical Models  Graphs provide a language for describing independence constraints  Applications to causal and probabilistic processes  The corresponding probabilistic models should obey the constraints encoded in the graph  Example: P(X 1, X 2, X 3 ) is Markov with respect to if X 1 is independent of X 3 given X 2 in P(  ) X1X1 X2X2 X3X3

Directed Graphical Models X1X1 X2X2 UX3X3 X4X4 X 2 X 4 X 2 X 4 | X 3 X 2 X 4 | {X 3, U}...

Marginalization X1X1 X2X2 X3X3 UX4X4 X 2 X 4 X 2 X 4 | X 3 X 2 X 4 | {X 3, U}...

Marginalization No: X 1 X 3 | X 2 X1X2X3X4 ? X1X2X3X4 ? No: X 2 X 4 | X 3 X1X2X3X4 ? OK, but not ideal X 2 X 4

The Acyclic Directed Mixed Graph (ADMG)  “Mixed” as in directed + bi-directed  “Directed” for obvious reasons  See also: chain graphs  “Acyclic” for the usual reasons  Independence model is  Closed under marginalization (generalize DAGs)  Different from chain graphs/undirected graphs  Analogous inference as in DAGs: m-separation X1X1 X2X2 X3X3 X4X4 (Richardson and Spirtes, 2002; Richardson, 2003)

Why do we care?  Difficulty on computing scores or tests  Identifiability: theoretical issues and implications to optimization Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X2X2 latent observed Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X3X3 Y6Y6 Y6Y6 X2X2 Candidate I Candidate II

Why do we care?  Set of “target” latent variables X (possibly none), and observations Y  Set of “nuisance” latent variables X   With sparse structure implied over Y Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X2X2 Y6Y6 X3X3 X4X4 X5X5 XX X

Why do we care? (Bollen, 1989)

The talk in a nutshell  The challenge:  How to specify families of distributions that respect the ADMG independence model, requires no explicit latent variable formulation  How NOT to do it: make everybody independent!  Needed: rich families. How rich?  Main results:  A new construction that is fairly general, easy to use, and complements the state-of-the-art  Exploring this in structure learning problems  First, background:  Current parameterizations, the good and bad issues

The Gaussian bi-directed model

The Gaussian bi-directed case (Drton and Richardson, 2003)

Binary bi-directed case: the constrained Moebius parameterization (Drton and Richardson, 2008)

Binary bi-directed case: the constrained Moebius parameterization  Disconnected sets are marginally independent. Hence, define q A for connected sets only P(X 1 = 0, X 4 = 0) = P(X 1 = 0)P(X 4 = 0) q 14 = q 1 q 4 However, notice there is a parameter q 1234

Binary bi-directed case: the constrained Moebius parameterization  The good:  this parameterization is complete. Every single binary bi-directed model can be represented with it  The bad:  Moebius inverse is intractable, and number of connected sets can grow exponentially even for trees of low connectivity...

The Cumulative Distribution Network (CDN) approach  Parameterizing cumulative distribution functions (CDFs) by a product of functions defined over subsets  Sufficient condition: each factor is a CDF itself  Independence model: the “same” as the bi-directed graph... but with extra constraints (Huang and Frey, 2008) F(X 1234 ) = F 1 (X 12 )F 2 (X 24 )F 3 (X 34 )F 4 (X 13 ) X 1 X 4 X 1 X 4 | X 2 etc

The Cumulative Distribution Network (CDN) approach  Which extra constraints?  Meaning, event “X 1  x 1 ” is independent of “X 3  x 3 ” given “X 2  x 2 ”  Clearly not true in general distributions  If there is no natural order for variable values, encoding does matter X1X1 X2X2 X3X3 F(X 123 ) = F 1 (X 12 )F 2 (X 23 )

Relationship  CDN: the resulting PMF (usual CDF2PMF transform)  Moebius: the resulting PMF is equivalent  Notice: q B = P(X B = 0) = P(X \B  1, X B  0)  However, in a CDN, parameters further factorize over cliques q 1234 = q 12 q 13 q 24 q 34

Relationship  Calculating likelihoods can be easily reduced to inference in factor graphs in a “pseudo-distribution”  Example: find the joint distribution of X 1, X 2, X 3 below X1X1 X2X2 X3X3 reduces to X1X1 X2X2 X3X3 Z1Z1 Z2Z2 Z3Z3 P(X = x) =

Relationship  CDN models are a strict subset of marginal independence models  Binary case: Moebius should still be the approach of choice where only independence constraints are the target  E.g., jointly testing the implication of independence assumptions  But...  CDN models have a reasonable number of parameters, for small tree-widths any fitting criterion is tractable, and learning is trivially tractable anyway by marginal composite likelihood estimation (more on that later)  Take-home message: a still flexible bi-directed graph model with no need for latent variables to make fitting tractable

The Mixed CDN model (MCDN)  How to construct a distribution Markov to this?  The binary ADMG parameterization by Richardson (2009) is complete, but with the same computational difficulties  And how to easily extend it to non-Gaussian, infinite discrete cases, etc.?

Step 1: The high-level factorization  A district is a maximal set of vertices connected by bi- directed edges  For an ADMG G with vertex set X V and districts {D i }, define where P(  ) is a density/mass function and pa G (  ) are parent of the given set in G

Step 1: The high-level factorization  Also, assume that each P i (  |  ) is Markov with respect to subgraph G i – the graph we obtain from the corresponding subset  We can show the resulting distribution is Markov with respect to the ADMG X 4 X 1

Step 1: The high-level factorization  Despite the seemingly “cyclic” appearance, this factorization always gives a valid P(  ) for any choice of P i (  |  ) P(X 134 ) =  x2 P(X 1, x 2 | X 4 )P(X 3, X 4 | X 1 )  P(X 1 | X 4 )P(X 3, X 4 | X 1 ) P(X 13 ) =  x4 P(X 1 | x 4 )P(X 3, x 4 | X 1 ) =  x4 P(X 1 )P(X 3, x 4 | X 1 )  P(X 1 )P(X 3 | X 1 )

Step 2: Parameterizing P i (barren case)  D i is a “barren” district is there is no directed edge within it Barren NOT Barren

Step 2: Parameterizing P i (barren case)  For a district D i with a clique set C i (with respect bi- directed structure), start with a product of conditional CDFs  Each factor F S (x S | x P ) is a conditional CDF function, P(X S  x S | X P = x P ). (They have to be transformed back to PMFs/PDFs when writing the full likelihood function.)  On top of that, each F S (x S | x P ) is defined to be Markov with respect to the corresponding G i  We show that the corresponding product is Markov with respect to G i

Step 2a: A copula formulation of P i  Implementing the local factor restriction could be potentially complicated, but the problem can be easily approached by adopting a copula formulation  A copula function is just a CDF with uniform [0, 1] marginals  Main point: to provide a parameterization of a joint distribution that unties the parameters from the marginals from the remaining parameters of the joint

Step 2a: A copula formulation of P i  Gaussian latent variable analogy: X1X1 X2X2 U X 1 = 1 U + e 1, e 1 ~ N(0, v 1 ) X 2 = 2 U + e 2, e 2 ~ N(0, v 2 ) U ~ N(0, 1) Marginal of X 1 : N(0, v 1 ) Covariance of X 1, X 2 : 1 2 Parameter sharing

Step 2a: A copula formulation of P i  Copula idea: start from then define H(Y a, Y b ) accordingly, where 0  Y *  1  H( ,  ) will be a CDF with uniform [0, 1] marginals  For any F i (  ) of choice, U i  F i (X i ) gives an uniform [0, 1]  We mix-and-match any marginals we want with any copula function we want F(X 1, X 2 ) = F( F 1 -1 (F 1 (X 1 )), F 2 -1 (F 2 (X 2 ))) H(Y a, Y b )  F( F 1 -1 (Y a ), F 2 -1 (Y b ))

Step 2a: A copula formulation of P i  The idea is to use a conditional marginal F i (X i | pa(X i )) within a copula  Example  Check: X1X1 X2X2 X3X3 X4X4 U 2 (x 1 )  P 2 (X 2  x 2 | x 1 ) U 3 (x 4 )  P 2 (X 3  x 3 | x 4 ) P(X 2  x 2, X 3  x 3 | x 1, x 4 ) = H(U 2 (x 1 ), U 3 (x 4 )) P(X 2  x 2 | x 1, x 4 ) = H(U 2 (x 1 ), 1) = H(U 2 (x 1 )) = U 2 (x 1 ) = P 2 (X 2  x 2 | x 1 )

Step 2a: A copula formulation of P i  Not done yet! We need this  Product of copulas is not a copula  However, results in the literature are helpful here. It can be shown that plugging in U i 1/d(i), instead of U i will turn the product into a copula  where d(i) is the number of bi-directed cliques containing X i Liebscher (2008)

Step 3: The non-barren case  What should we do in this case? Barren NOT Barren

Step 3: The non-barren case

Parameter learning  For the purposes of illustration, assume a finite mixture of experts for the conditional marginals for continuous data  For discrete data, just use the standard CPT formulation found in Bayesian networks

Parameter learning  Copulas: we use a bi-variate formulation only (so we take products “over edges” instead of “over cliques”).  In the experiments: Frank copula

Parameter learning  Suggestion: two-stage quasiBayesian learning  Analogous to other approaches in the copula literature  Fit marginal parameters using the posterior expected value of the parameter for each individual mixture of experts  Plug those in the model, then do MCMC on the copula parameters  Relatively efficient, decent mixing even with random walk proposals  Nothing stopping you from using a fully Bayesian approach, but mixing might be bad without some smarter proposals  Notice: needs constant CDF-to-PDF/PMF transformations!

Experiments

The story so far  General toolbox for construction for ADMG models  Alternative estimators would be welcome:  Bayesian inference is still “doubly-intractable” (Murray et al., 2006), but district size might be small enough even if one has many variables  Either way, composite likelihood still simple. Combined with the Huang + Frey dynamic programming method, it could go a long way  Hybrid Moebius/CDN parameterizations to be exploited  Empirical applications in problems with extreme value issues, exploring non-independence constraints, relations to effect models in the potential outcome framework etc.

Back to: Learning Latent Structure  Difficulty on computing scores or tests  Identifiability: theoretical issues and implications to optimization Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X2X2 latent observed Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X3X3 Y6Y6 Y6Y6 X2X2 Candidate I Candidate II

Leveraging Domain Structure  Exploiting “main” factors Y 7a Y7bY7b Y 7c X7X7 Y7dY7d Y 7e Y 12a X 12 Y 12b Y 12c (NHS Staff Survey, 2009)

The “Structured Canonical Correlation” Structural Space  Set of pre-specified latent variables X, observations Y  Each Y in Y has a pre-specified single parent in X  Set of unknown latent variables X  X  Each Y in Y can have potentially infinite parents in X   “Canonical correlation” in the sense of modeling dependencies within a partition of observed variables Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X2X2 Y6Y6 X3X3 X4X4 X5X5 XX X

The “Structured Canonical Correlation”: Learning Task  Assume a partition structure of Y according to X is known  Define the mixed graph projection of a graph over (X, Y) by a bi-directed edge Y i  Y j if they share a common ancestor in X   Practical assumption: bi-directed substructure is sparse  Goal: learn bi-directed structure (and parameters) so that one can estimate functionals of P(X | Y) Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 X1X1 X2X2 Y6Y6

Parametric Formulation  X ~ N(0,  ),  positive definite  Ignore possibility of causal/sparse structure in X for simplicity  For a fixed graph G, parameterize the conditional cumulative distribution function (CDF) of Y given X according to bi-directed structure:  F(y | x)  P(Y  y | X = x)   P i (Y i  y i | X [i] = x [i] )  Each set Y i forms a bi-directed clique in G, X [i] being the corresponding parents in X of the set Y i  We assume here each Y is binary for simplicity

Parametric Formulation  In order to calculate the likelihood function, one should convert from the (conditional) CDF to the probability mass function (PMF)  P(y, x) = {  F(y | x)} P(x)   F(y | x) represents a difference operator. As we discussed before, for p-dimensional binary (unconditional) F(y) this boils down to

Learning with Marginal Likelihoods  For X j parent of Y i in X:  Let  Marginal likelihood:  Pick graph G m that maximizes the marginal likelihood (maximizing also with respect to  and  ), where  parameterizes local conditional CDFs F i (y i | x [i] )

Computational Considerations  Intractable, of course  Including possible large tree-width of bi-directed component  First option: marginal bivariate composite likelihood G m +/- is the space of graphs that differ from G m by at most one bi-directed edge Integrates  ij and X 1:N with a crude quadrature method

Beyond Pairwise Models  Wanted: to include terms that account for more than pairwise interactions  Gets expensive really fast  An indirect compromise:  Still only pairwise terms just like PCL  However, integrate  ij not over the prior, but over some posterior that depends on more than on Y i 1:N, Y j 1:N :  Key idea: collect evidence from p(  ij | Y S 1:N ), {i, j}  S, plug it into the expected log of marginal likelihood. This corresponds to bounding each term of the log-composite likelihood score with different distributions for  ij :

Beyond Pairwise Models  New score function  S k : observed children of X k in X  Notice: multiple copies of likelihood for  ij when Y i and Y j have the same latent parent  Use this function to optimize parameters { ,  }  (but not necessarily structure)

Learning with Marginal Likelihoods  Illustration: for each pair of latents X i, X j, do i j q ij (  ij )  p(  ij | Y S, ,  )  ijk Compute by Laplace approximation and dynamic programming q ij (  ij ) Defines ~ … + E q(  ijk) [log P(Y ij1a, Y ij1b,  ijk | ,  )] + … Y ij1a Y ij1b Marginalize and add term ~

Algorithm 2  q mn comes from conditioning on all variables that share a parent with Y i and Y j  In practice, we use PCL when optimizing structure  EM issues with discrete optimization: model without edge has an advantage, sometimes bad saddlepoint

Experiments: Synthetic Data  20 networks of 4 latent variables with 4 children per latent variable  Average number of bi-directed edges: ~18  Evaluation criteria:  Mean-squared error of estimate of slope  for each observed variable  Edge omission error (false negatives)  Edge commission error (false positives)  Comparison against “single-shot” learning  Fit model without bi-directed edges, add edge Y i  Y j if implied pairwise distribution P(Y i, Y j ) doesn’t fit the data  Essentially a single iteration of Algorithm 1

Experiments: Synthetic Data  Quantify results by taking the difference between number of times Algorithm 2 does better than Algorithm 1 and 0 (“single-shot” learning)  The number of times where the difference is positive with the corresponding p-values for a Wilcoxon signed rank test (stars indicate numbers less than 0.05)

Experiments: NHS Data  Fit model with 9 factors and 50 variables on the NHS data, using questionnaire as the partition structure  100,000 points in training set, about 40 edges discovered  Evaluation:  Test contribution of bi-directed edge dependencies to P(X | Y): compare against model without bi-directed edges  Comparison by predictive ability: find embedding for each X (d) given Y (d) by maximizing  Test on independent 50,000 points by evaluating how well we can predict other 11 answers based on latent representation using logistic regression

Experiments: NHS Data  MCCA: mixed graph structured canonical correlation model  SCCA: null model (without bi-directed edges)  Table contains AUC scores for each of the 11 binary prediction problems using estimated X as covariates:

Conclusion  Marginal composite likelihood and mixed graph models are a good match  Still requires some choices of approximations for posteriors over parameters, and numerical methods for integration  Future work:  Theoretical properties of the alternative marginal composite likelihood estimator  Identifiability issues  Reduction on the number of evaluations of q mn  Non-binary data  Which families could avoid multiple passes over data?