Learning Graphical Models Eric Xing Lecture 12, August 14, 2010

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

CS479/679 Pattern Recognition Dr. George Bebis

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Supervised Learning Recap

Visual Recognition Tutorial

Eric Xing © Eric CMU, Machine Learning Learning Graphical Models Eric Xing Lecture 12, August 14, 2010 Reading:

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models Lecture 6, Thursday April 17, 2003.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Lecture 5: Learning models using EM

CpG islands in DNA sequences

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Hidden Markov Models.

Visual Recognition Tutorial

Today Logistic Regression Decision Trees Redux Graphical Models

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

CS Statistical Machine learning Lecture 24

Lecture 2: Statistical learning primer for biologists

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

CSE 517 Natural Language Processing Winter 2015

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Hidden Markov Models BMI/CS 576

Lecture 7: Constrained Conditional Models

Oliver Schulte Machine Learning 726

Chapter 3: Maximum-Likelihood Parameter Estimation

Hidden Markov Models.

Statistical Models for Automatic Speech Recognition

Oliver Schulte Machine Learning 726

Hidden Markov Models - Training

Machine Learning Basics

Latent Variables, Mixture Models and EM

Recovering Temporally Rewiring Networks: A Model-based Approach

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Estimating Networks With Jumps

CS498-EA Reasoning in AI Lecture #20

Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

CONTEXT DEPENDENT CLASSIFICATION

EM for Inference in MV Data

Bayesian Learning Chapter

Parameter Learning 2 Structure Learning 1: The good

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Expectation Maximization Eric Xing Lecture 10, August 14, 2010

EM for Inference in MV Data

Learning Bayesian networks

Presentation transcript:

Learning Graphical Models Eric Xing Lecture 12, August 14, 2010 Machine Learning Learning Graphical Models Eric Xing Lecture 12, August 14, 2010 Eric Xing © Eric Xing @ CMU, 2006-2010 Reading:

Inference and Learning A BN M describes a unique probability distribution P Typical tasks: Task 1: How do we answer queries about P? We use inference as a name for the process of computing answers to such queries So far we have learned several algorithms for exact and approx. inference Task 2: How do we estimate a plausible model M from data D? We use learning as a name for the process of obtaining point estimate of M. But for Bayesian, they seek p(M |D), which is actually an inference problem. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data. Eric Xing © Eric Xing @ CMU, 2006-2010

Learning Graphical Models The goal: Given set of independent samples (assignments of random variables), find the best (the most likely?) graphical model (both the graph and the CPDs) (B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F) …….. (B,E,A,C,R)=(F,T,T,T,F) E R B A C E B R A C 0.9 0.1 e b 0.2 0.8 0.01 0.99 B E P(A | E,B) Eric Xing © Eric Xing @ CMU, 2006-2010

Learning Graphical Models Scenarios: completely observed GMs directed undirected partially observed GMs undirected (an open research topic) Estimation principles: Maximal likelihood estimation (MLE) Bayesian estimation Maximal conditional likelihood Maximal "Margin" We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data. Ö Ö Ö Ö Eric Xing © Eric Xing @ CMU, 2006-2010

ML Parameter Est. for completely observed GMs of given structure Z X ML Parameter Est. for completely observed GMs of given structure The data: {(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))} Eric Xing © Eric Xing @ CMU, 2006-2010

The basic idea underlying MLE Likelihood (for now let's assume that the structure is given): Log-Likelihood: Data log-likelihood MLE X1 X2 X3 Eric Xing © Eric Xing @ CMU, 2006-2010

Example 1: conditional Gaussian The completely observed model: Z is a class indicator vector X is a conditional Gaussian variable with a class-specific mean Z X and a datum is in class i w.p. pi All except one of these terms will be one Eric Xing © Eric Xing @ CMU, 2006-2010

Example 1: conditional Gaussian Data log-likelihood MLE Z X the fraction of samples of class m the average of samples of class m Eric Xing © Eric Xing @ CMU, 2006-2010

Example 2: HMM: two scenarios Supervised learning: estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Unsupervised learning: estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|) --- Maximal likelihood (ML) estimation Eric Xing © Eric Xing @ CMU, 2006-2010

Recall definition of HMM Transition probabilities between any two states or Start probabilities Emission probabilities associated with each state or in general: A x2 x3 x1 xT y2 y3 y1 yT ... Eric Xing © Eric Xing @ CMU, 2006-2010

Supervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is known, Define: Aij = # times state transition ij occurs in y Bik = # times state i in y emits k in x We can show that the maximum likelihood parameters  are: If y is continuous, we can treat as N´T observations of, e.g., a Gaussian, and apply learning rules for Gaussian … Eric Xing © Eric Xing @ CMU, 2006-2010

Supervised ML estimation, ctd. Intuition: When we know the underlying states, the best estimate of  is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 y = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1 Eric Xing © Eric Xing @ CMU, 2006-2010

Pseudocounts Solution for small training sets: Add pseudocounts Aij = # times state transition ij occurs in y + Rij Bik = # times state i in y emits k in x + Sik Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = SjRij , Si = SkSik , --- "strength" of prior belief, --- total number of imaginary instances in the prior Larger total pseudocounts  strong prior belief Small total pseudocounts: just to avoid 0 probabilities --- smoothing This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts Eric Xing © Eric Xing @ CMU, 2006-2010

MLE for general BN parameters If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node: X2=1 X2=0 X5=0 X5=1 Eric Xing © Eric Xing @ CMU, 2006-2010

Example: decomposable likelihood of a directed model Consider the distribution defined by the directed acyclic GM: This is exactly like learning four separate small BNs, each of which consists of a node and its parents. X4 X2 X3 X1 X1 X4 X2 X3 Eric Xing © Eric Xing @ CMU, 2006-2010

E.g.: MLE for BNs with tabular CPDs Assume each CPD is represented as a table (multinomial) where Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table The sufficient statistics are counts of family configurations The log-likelihood is Using a Lagrange multiplier to enforce , we get: Eric Xing © Eric Xing @ CMU, 2006-2010

Learning partially observed GMs Z X Learning partially observed GMs The data: {(x(1)), (x(2)), (x(3)), ... (x(N))} Eric Xing © Eric Xing @ CMU, 2006-2010

What if some nodes are not observed? Consider the distribution defined by the directed acyclic GM: Need to compute p(xH|xV)  inference X1 X2 X3 X4 Eric Xing © Eric Xing @ CMU, 2006-2010

Recall: EM Algorithm A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: Estimate some “missing” or “unobserved” data from observed data and current parameters. Using this “complete” data, find the maximum likelihood parameter estimates. Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: E-step: M-step: In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood. Eric Xing © Eric Xing @ CMU, 2006-2010

EM for general BNs while not converged % E-step for each node i ESSi = 0 % reset expected sufficient statistics for each data sample n do inference with Xn,H % M-step qi := MLE(ESSi ) Eric Xing © Eric Xing @ CMU, 2006-2010

Example: HMM Supervised learning: estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Unsupervised learning: estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|) --- Maximal likelihood (ML) estimation Eric Xing © Eric Xing @ CMU, 2006-2010

The Baum Welch algorithm The complete log likelihood The expected complete log likelihood EM The E step The M step ("symbolically" identical to MLE) Eric Xing © Eric Xing @ CMU, 2006-2010

Unsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is unknown, EXPECTATION MAXIMIZATION 0. Starting with our best guess of a model M, parameters : Estimate Aij , Bik in the training data How? , , Update  according to Aij , Bik Now a "supervised learning" problem Repeat 1 & 2, until convergence This is called the Baum-Welch Algorithm We can get to a provably more (or equally) likely parameter set  each iteration Eric Xing © Eric Xing @ CMU, 2006-2010

ML Structural Learning for completely observed GMs Data Eric Xing © Eric Xing @ CMU, 2006-2010

Information Theoretic Interpretation of ML From sum over data points to sum over count of variable states Eric Xing © Eric Xing @ CMU, 2006-2010

Information Theoretic Interpretation of ML (con'd) Decomposable score and a function of the graph structure Eric Xing © Eric Xing @ CMU, 2006-2010

Structural Search How many graphs over n nodes? How many trees over n nodes? But it turns out that we can find exact solution of an optimal tree (under MLE)! Trick: in a tree each node has only one parent! Chow-liu algorithm Eric Xing © Eric Xing @ CMU, 2006-2010

Chow-Liu tree learning algorithm Objection function: Chow-Liu: For each pair of variable xi and xj Compute empirical distribution: Compute mutual information: Define a graph with node x1,…, xn Edge (I,j) gets weight Þ Eric Xing © Eric Xing @ CMU, 2006-2010

Chow-Liu algorithm (con'd) Objection function: Chow-Liu: Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence: Þ A B C D E Eric Xing © Eric Xing @ CMU, 2006-2010

Structure Learning for general graphs Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2 Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways Greedy search through space of node-orders Local search of graph structures Eric Xing © Eric Xing @ CMU, 2006-2010

Inferring gene regulatory networks Network of cis-regulatory pathways Success stories in sea urchin, fruit fly, etc, from decades of experimental research Statistical modeling and automated learning just started Eric Xing © Eric Xing @ CMU, 2006-2010

Gene Expression Profiling by Microarrays Eric Xing © Eric Xing @ CMU, 2006-2010

Structure Learning Algorithms B A C Learning Algorithm Expression data Structural EM (Friedman 1998) The original algorithm Sparse Candidate Algorithm (Friedman et al.) Discretizing array signals Hill-climbing search using local operators: add/delete/swap of a single edge Feature extraction: Markov relations, order relations Re-assemble high-confidence sub-networks from features Module network learning (Segal et al.) Heuristic search of structure in a "module graph" Module assignment Parameter sharing Prior knowledge: possible regulators (TF genes) Eric Xing © Eric Xing @ CMU, 2006-2010

Learning GM structure Learning of best CPDs given DAG is easy collect statistics of values of each node given specific assignment to its parents Learning of the graph topology (structure) is NP-hard heuristic search must be applied, generally leads to a locally optimal network Overfitting It turns out, that richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable) more parameters to fit => more freedom => always exist more "optimal" CPD(C) We prefer simpler (more explanatory) networks Practical scores regularize the likelihood improvement complex networks. A C B Eric Xing © Eric Xing @ CMU, 2006-2010

Learning Graphical Model Structure via Neighborhood Selection Eric Xing © Eric Xing @ CMU, 2006-2010

Undirected Graphical Models Why? Sometimes an UNDIRECTED association graph makes more sense and/or is more informative gene expressions may be influenced by unobserved factor that are post-transcriptionally regulated The unavailability of the state of B results in a constrain over A and C B A C B A C B A C Eric Xing © Eric Xing @ CMU, 2006-2010

Gaussian Graphical Models Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: Eric Xing © Eric Xing @ CMU, 2006-2010

The covariance and the precision matrices Covariance matrix Graphical model interpretation? Precision matrix Eric Xing © Eric Xing @ CMU, 2006-2010

Sparse precision vs. sparse covariance in GGM 1 2 3 4 5 Eric Xing © Eric Xing @ CMU, 2006-2010

Another example How to estimate this MRF? What if p >> n MLE does not exist in general! What about only learning a “sparse” graphical model? This is possible when s=o(n) Very often it is the structure of the GM that is more interesting … Eric Xing © Eric Xing @ CMU, 2006-2010

Single-node Conditional The conditional dist. of a single node i given the rest of the nodes can be written as: WOLG: let We can write the following conditional auto-regression function for each node: Eric Xing © Eric Xing @ CMU, 2006-2010

Conditional independence From Let: Given an estimate of the neighborhood si, we have: Thus the neighborhood si defines the Markov blanket of node i Eric Xing © Eric Xing @ CMU, 2006-2010

Learning (sparse) GGM Multivariate Gaussian over all continuous expressions The precision matrix Q=S-1 reveals the topology of the (undirected) network Learning Algorithm: Covariance selection Want a sparse matrix Q As shown in the previous slides, we can use L_1 regularized linear regression to obtain a sparse estimate of the neighborhood of each variable Eric Xing © Eric Xing @ CMU, 2006-2010

Learning Ising Model (i.e. pairwise MRF) Assuming the nodes are discrete, and edges are weighted, then for a sample xd, we have It can be shown following the same logic that we can use L_1 regularized logistic regression to obtain a sparse estimate of the neighborhood of each variable in the discrete case. Eric Xing © Eric Xing @ CMU, 2006-2010

Recall lasso Eric Xing © Eric Xing @ CMU, 2006-2010

Graph Regression Lasso: Neighborhood selection Eric Xing © Eric Xing @ CMU, 2006-2010

Graph Regression Eric Xing © Eric Xing @ CMU, 2006-2010

Graph Regression It can be shown that: given iid samples, and under several technical conditions (e.g., "irrepresentable"), the recovered structured is "sparsistent" even when p >> n Eric Xing © Eric Xing @ CMU, 2006-2010

Consistency Theorem: for the graphical regression algorithm, under certain verifiable conditions (omitted here for simplicity): Note the from this theorem one should see that the regularizer is not actually used to introduce an “artificial” sparsity bias, but a devise to ensure consistency under finite data and high dimension condition. Eric Xing © Eric Xing @ CMU, 2006-2010

Recent trends in GGM: Covariance selection (classical method) Dempster [1972]: Sequentially pruning smallest elements in precision matrix Drton and Perlman [2008]: Improved statistical tests for pruning L1-regularization based method (hot !) Meinshausen and Bühlmann [Ann. Stat. 06]: Used LASSO regression for neighborhood selection Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix Friedman et al. [Biostatistics 08]: Efficient fixed-point equations based on a sub-gradient algorithm … Serious limitations in practice: breaks down when covariance matrix is not invertible Structure learning is possible even when # variables ＞ # samples Eric Xing © Eric Xing @ CMU, 2006-2010

Learning GM Learning of best CPDs given DAG is easy collect statistics of values of each node given specific assignment to its parents Learning of the graph topology (structure) is NP-hard heuristic search must be applied, generally leads to a locally optimal network We prefer simpler (more explanatory) networks Regularized graph regression Eric Xing © Eric Xing @ CMU, 2006-2010