Latent Tree Models Part II: Definition and Properties

Slides:



Advertisements
Similar presentations
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Advertisements

A Tutorial on Learning with Bayesian Networks
Hierarchical Dirichlet Processes
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Random Variables ECE460 Spring, 2012.
Latent Tree Models Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. AAAI 2014 Tutorial.
Dynamic Bayesian Networks (DBNs)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Introduction of Probabilistic Reasoning and Bayesian Networks
Phylogenetic Trees Lecture 4
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
(Social) Networks Analysis I
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Lecture 16: Wrap-Up COMP 538 Introduction of Bayesian networks.
Phylogenetic Trees Presenter: Michael Tung
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
A Unifying Review of Linear Gaussian Models
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
5-1 Two Discrete Random Variables Example Two Discrete Random Variables Figure 5-1 Joint probability distribution of X and Y in Example 5-1.
5-1 Two Discrete Random Variables Example Two Discrete Random Variables Figure 5-1 Joint probability distribution of X and Y in Example 5-1.
Modern Navigation Thomas Herring
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Sum-Product Networks CS886 Topics in Natural Language Processing
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Latent Tree Models & Statistical Foundation for TCM Nevin L. Zhang Joint Work with: Chen Tao, Wang Yi, Yuan Shihong Department of Computer Science & Engineering.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Lecture 2: Statistical learning primer for biologists
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Introduction on Graphic Models
Gaussian Process Networks Nir Friedman and Iftach Nachman UAI-2K.
Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
CSE 4705 Artificial Intelligence
Latent variable discovery in classification models
Background Information for Project
Latent Tree Analysis Nevin L. Zhang* and Leonard K. M. Poon**
Distributions and Concepts in Probability Theory
Hidden Markov Models Part 2: Algorithms
Tutorial 9: Further Topics on Random Variables 2
Markov Random Fields Presented by: Vladan Radosavljevic.
Class #16 – Tuesday, October 26
The Most General Markov Substitution Model on an Unrooted Tree
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Perfect Phylogeny Tutorial #10
A quantum machine learning algorithm based on generative models
Presentation transcript:

Latent Tree Models Part II: Definition and Properties AAAI 2014 Tutorial Latent Tree Models Part II: Definition and Properties Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang Now we move on to Part II of the tutorial. Here I will give the precise definition of latent tree models, discuss their properties, and explain how they are related to other models in the literature.

Part II: Concept and Properties Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties

Basic Latent Tree Models (LTM) Bayesian network All variables are discrete Structure is a rooted tree Leaf nodes are observed (manifest variables) Internal nodes are not observed (latent variables) Parameters: P(Y1), P(Y2|Y1),P(X1|Y2), P(X2|Y2), … Semantics: The basic latent tree model is a BN where all variables are discrete, the structure is a rooted tree, the leaf nodes are observed and are sometimes called manifest variables, and the internal nodes are not observed and are called latent variables. Probability parameters for the model include a marginal distribution for the root node and a conditional distribution for each non-rooted node given its parent. The product of all those distributions is a joint distribution over all the variables. In early publications, latent tree models were called them hierarchical latent class models, because they generalize latent class models, a class finite mixture models for discrete data. Also known as Hierarchical latent class (HLC) models, HLC models (Zhang. JMLR 2004)

Joint Distribution over Observed Variables Marginalizing out the latent variables in , we get a joint distribution over the observed variables . In comparison with Bayesian network without latent variables, LTM: Is computationally very simple to work with. Represent complex relationships among manifest variables. What does the structure look like without the latent variables? An LTM represents a joint distribution over the observed and the latent variables. If we marginalize out the latent variables, we get a distribution over the observed variables only. So, an LTM can also be said to represent a joint distribution of the observed variables. To represent a joint distribution over observed variables, we can use a Bayesian network without latent variable. In comparison, the advantage of LTM are: On one hand, it is computationally simple to work with because of the tree structure, and on the other hand, it can represent complex relationship among the observed variables. To see this, we can imagine the relationships among the observed variables in this model without using latent variable: We would need a complete model where every pair of variables are connected by an edge. These two characteristics make LTMs a very interesting class of models.

Pouch Latent Tree Models (PLTM) An extension of basic LTM Rooted tree Internal nodes represent discrete latent variables Each leaf node consists of one or more continuous observed variable, called a pouch. (Poon et al. ICML 2010) Pouch latent tree models are an extension of the basic LTM model. It is still a rooted tree, and the internal nodes still represent discrete latent variables. However, each leaf node consists of 1 or more continuous observed variables. Because it might contain multiple latent variables, it is hence called a pouch. For each pouch, there is a conditional Gaussian distributions for all its observed variables given the parent. For the pouch at the bottom left corner, e.g., we have a Gaussian distribution for X1 and X2 given the parent Y2.

More General Latent Variable Tree Models Some internal nodes can be observed Internal nodes can be continuous Forest Primary focus of this tutorial: the basic LTM (Choi et al. JMLR 2011) In the literature, there are more general latent variable tree models. In some cases, some internal nodes can be observed, as in this example taken from Choi et al 2011. Here the node SLB, UTX are observed. In addition, the internal nodes might be continuous, and the overall structure might not be a forest instead of a tree. In this tutorial, our primary focus in on the basic LTM, although we will touch the other models here and there.

Part II: Concept and Properties Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties Latent tree models are closely related to finite mixture models. We will see how in the next few minutes.

Finite Mixture Models (FMM) Gaussian Mixture Models (GMM): Continuous attributes Graphical model The most common finite mixture model is the Gaussian mixture model. It is for continuous data. It assumes that data consist of K clusters. The distribution is each cluster is a normal distribution with mean vector mu-k and covariance matrix Sigma-k. The distribution of the whole data set is a mixture of those Gaussian distributions with mixing co-efficients pi_k. The red figure is a trivial graphical representation of GMM. The latent variable Z represents the K-clusters, while X represents the vector of attributes. If we spell out all the attributes, we get the graphical model on the right.

Finite Mixture Models (FMM) GMM with independence assumption Block diagonal co-variable matrix Graphical Model Sometimes, independence assumptions are made, that is, the covariance matrix is assumed to have a block diagonal form. In this example, X3 is assumed to be independent from X1 and X2. To represent the independence graphically, we can have two pouches, one containing X1 and X2, while the other containing X3. To put it another way, the figure on the bottom shows a GMM with independence assumption.

Finite Mixture Models Latent class models (LCM): Discrete attributes Distribution for cluster k: Product multinomial distribution: All FMMs One latent variable Yielding one partition of data Graphical Model The counter-part of GMM for discrete data is the latent class model. It assumes that data consist of K clusters. Within each cluster, the attributes are mutually independent. So, the distribution for each cluster is a product multinomial distribution. The distribution of the whole data is a mixture of those product multinomial distributions. In all finite mixture models, there is only one latent variable, and only one partition of data is produced.

From FMMs to LTMs Start with several GMMs, Each based on a distinct subset of attributes Each partitions data from a certain perspective. Different partitions are independent of each other Link them up to form a tree model Get Pouch LTM Consider different perspectives in a single model Multiple partitions of data that are correlated. Conceptually, a latent tree mode can be viewed as a collection of finite mixture models that are linked up to form one single model. For example, we can start with those 3 GMMs. The first two have independence assumptions, while the third one does not. They are based on different attributes. They each can produce a partition of data. So, three partitions of data are obtained. Different partitions focus on different aspects of the data. An they are independent of each other. Now, we can link them up to form one single model. Here an extra latent variable is introduced. In general, this is not necessary. This bigger model produces multiple partition of data, just as the collection of smaller models shown above. The only difference is that, now the different partitions are related.

From FMMs to LTMs Start with several LCMs, Each based on a distinct subset of attributes Each partitions data from a certain perspective. Different partitions are independent of each other Link them up to form a tree model Get LTM Consider different perspectives in a single model Multiple partitions of data that are correlated. Here we have three latent class models, each based on a distinct subset of attributes. They can give us 3 independent partitions of data, each focuses on an aspect of the data. The partitions are independent of each other. If we link up the three models, we could get this overall model. It produces three related partitions of data. In summary, an LTM can be viewed as a collections of FMMs, with their latent variables linked up to form a tree structure. In this sense, LTM is an extension of LCMs, and pouch LTM is an extension of GMM. Summary: An LTM can be viewed as a collections of FMMs, with their latent variables linked up to form a tree structure.

Part II: Concept and Properties Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties Latent tree models are also closely related to phylogenetic trees.

Phylogenetic trees TAXA (sequences) identify species Edge lengths represent evolution time Usually, bifurcating tree topology Durbin, et al. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. Phylogenetic trees depict the evolution history of species. Each species is represented as a sequence called taxa. The edge length represents evolution time. Usually, phylogenetic trees are binary, meaning one species evolving into two different species at one time.

Probabilistic Models of Evolution Two assumptions There are only substitutions, no insertions/deletions (aligned) One-to-one correspondence between sites in different sequences Each site evolves independently and identically P(x|y, t) = Pi=1 to m P(x(i) | y(i), t) m is sequence length P(x(i)|y(i), t) Jukes-Cantor (Character Evolution) Model [1969] Rate of substitution a In probabilistic models of evolution, it is typically assumed that evolution happens only because of substitutions of characters. There are no insertion or deletions of characters. It is further assumed that each site evolves independently. Under those assumptions, the probability and a sequence Y evolves into another sequence X in time t is the product of site evolution probability. Here, P(x(i)|Y(i), t) is the probability that the character at site i of sequence Y evolves into the character at site i of X, in time t. In the Juke-Cantor character evolution model, it is given by the matrix on the right, where alpha is the rate of substitution. We see that, when t=0, the matrix is a diagonal matrix with 1 on the diagonal, which indicates no evolution. When t goes to infinite, all the cells in the matrix goes to 1/4.

Phylogenetic Trees are Special LTMs When focus on one site, phylogenetic trees are special latent tree models The structure is a binary tree The variables share the same state space. Each conditional distribution is characterized by only one parameters, i.e., the length of the corresponding edge Because different sites evolve independently and in identical manners, we can focus on the evolution of one site. When we do that, a phylogenetic tree becomes a latent tree where: the structure is binary, each observed and latent variable share the same state space (A, C, G, T), and each conditional distribution is characterized by one parameters, i.e., the length of the correspond edge. So, latent tree models can be viewed as a generalization of phylogenetic trees, where: a node can have more than two children, different variables may have different state spaces, and conditional distributions are multinomial.

Hidden Markov Models Hidden Markov models are also special latent tree models All latent variables share the same state space. All observed variables share the same state space. P(yt |st ) and P(st+1 | st ) are the same for different t ’s. Finally, hidden Markov models are obviously special latent tree models

Part II: Concept and Basic Properties Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties I have now finished discussing the concept of latent tree models, and how they are related to various other models. Next, I will discuss some basic properties of latent tree models.

Two Concepts of Models So far, a model consists of Observed and latent variables Connections among the variables Probability values For the rest of Part II, a model consists of Probability parameters To do that, it would be helpful to distinguish between two different concept of models. So far in this tutorial, our view about a model is that it consists of some observed and latent variables, connections among the variables, and probability values. For the next part, we need to take the view that a model consists of some observed and latent variables, connections among the variables, and probability parameters, rather than probability values.

Model Inclusion Let me first introduce the notion of model inclusion. Suppose we have two latent tree models m and m’ with the same observed variables. By setting the probability parameters at different values, the models represent different distributions over the observed variables. So, each of the model correspond to a set of joint distributions over the observed variables. We say that the model m includes the model m’ if the collection of joint distributions that m can represent is a super set of the collection of distributions that m’ can represent. In other words, for each vector of parameter values theta’ for m’, we can always find a vector of parameter values theta for m such that the two models give the same distributions over the observed variables.

Model Equivalence If m includes m’ and vice versa, then they are marginally equivalent. If they also have the same number of free parameters, then they are equivalent. It is not possible to distinguish between equivalent models based on data. If m includes m’ and m’ also includes m, then the two model can represent exactly the same collection of distributions over the observed variables. In the case, we say that they are marginally equivalent. If they also have the same number of free parameters, then we say that they are equivalent. It is not possible to distinguish between equivalent models based on data.

Root Walking Rook walking is an operation that we can apply on a latent tree. It changes the root of the model. To be more specific, it makes a neighbor of the current root as the new root.

Root Walking Example Root walks to X2; Root walks to X3 As an example, suppose we start from the model on the top. The root is Y1. We can let the root walk from Y1 to Y2. Then we get the model on the left where the root is Y2. On the other hand, if we let the root walk from Y1 to Y3, then we get the model on the right, where the root is Y3.

Root Walking Theorem: Root walking leads to equivalent latent tree models. (Zhang, JMLR 2004) Special case of covered arc reversal in general Bayesian network, Chickering, D. M. (1995). A transformational characterization of equivalent Bayesian network structures. UAI. It has been shown that root walk lead to equivalent latent tree models. In our example, all those three models are equivalent. The result is a special case of a more general result for Bayesian networks called covered arc reversal.

Implication Edge orientations in latent tree models are not identifiable. Technically, better to start with alternative definition of LTM: A latent tree model (LTM) is a Markov random field over an undirected tree, or tree-structured Markov network where variables at leaf nodes are observed and variables at internal nodes are hidden. The implication of the theorem is that edge orientations in latent tree models cannot be determined from data. Given this fact, a technically cleaner way is to define LTM as follows: a latent tree model is a Markov random field over an undirected tree, or a tree-structured Markov network, where the leaf are observed and the internal nodes are not.

Implication For technical convenience, we often root an LTM at one of its latent nodes and regard it as a directed graphical model. Rooting the model at different latent nodes lead to equivalent directed models. This is why we introduced LTM as directed models. For convenience, we often root LTM at one of its latent nodes and regard it as a directed model. The choice of root does not matter, as difference choice lead to equivalent models.

Regularity |X|: Cardinality of variable X, i.e., the number of states. The next issue is regularity. An LTM is regular if each latent variable does not have too many states w.r.t to its neighbors. To the specific, the cardinality of a latent variable Z should no greater than the product of the cardinalities of all its neighbors, divided by the max of neighbor cardinalities. The inequality is strict when Z has only two neighbors.

Regularity Can focus on regular models only Irregular models can be made regular Regularized models better than irregular models Theorem: The set of all regular models for a given set of observed variables is finite. (Zhang, JMLR 2004) It has been shown that an irregular model can be reduced to another model that is marginally equivalent and has fewer free parameters. As such, we can focus on regular models only. It has also been shown, for a given set of observed variables, there are only finite many regular models.