Latent Tree Models Part I: Concept and Algorithms

Latent Tree Models Part I: Concept and Algorithms
Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology

Latent Tree Models (LTMs)
Tree-structured Bayesian network All variables are discrete Variables at leaf nodes are observed Variables at internal nodes are latent Parameters: P(Y1), P(Y2|Y1),P(X1|Y2), P(X2|Y2), … Semantics: Also known as Hierarchical latent class (HLC) models, HLC models (Zhang. JMLR 2004)

Extensions of Latent Tree Models
Pouch latent tree models Rooted tree Internal nodes represent discrete latent variables Each leaf node consists of one or more continuous observed variables, called a pouch. (Poon et al. ICML 2010) There are variations of the basic latent tree models. One variation is called pouch latent tree model, where a leaf node represents one ore more continuous observed variables.

Extensions of Latent Tree Models
Internal nodes can be observed and/or continuous Structure can be forest Primary focus of this talk: The basic LTM (Choi et al. JMLR 2011) (Mourad et al. 2011) In other variations, internal nodes can be observed and continuous. And the overall structure can be a forest rather than a tree. In this talks, I will focus on the basic latent tree model.

Co-Occurrence Modelling and Multidimensional Clustering
Repeated event co-occurrences might Due to common hidden causes or direct correlations, OR Be coincidental, esp. in big data Challenge: Identify co-occurrences due to hidden causes or correlations. LTMs are useful here because: By fitting a latent tree model we can detect co-occurrences that can be statistically explained by a tree of latent variables In addition, each latent variable gives a way to partition data. Hence, LTMs are a tool for multidimensional clustering.

Modelling Word Co-Occurrences in Documents
Latent variables At lowest level reveal word co-occurrence patterns At higher levels reveal co-occurrence of patterns at the level below Latent states Identify clusters of documents and can be interpreted as topic LTMs can be used as a tool for hierarchical topic detection. It significantly outperforms alternative methods.

Modelling Co-Consumption of Items in Online Transaction Data
Latent variables reveal co-consumption of items by users Latent states identify user clusters with different tastes A user should share interests of other users with the same tastes. Leading to a novel collaborative filtering method for implicit feedback, which outperforms all alternatives.

Modelling Symptom Co-Occurrences in Medical Data
Latent variable reveal co-occurrence of symptoms Latent states identify patient clusters with the co-occurrence patterns Can be targets for further basic research Of fundamental importance traditional Chinese medicine (TCM) because TCM patient class definitions are subjective and vague

Analysis of Gene Expression Data
(Gitter et al, 2016) Use latent tree models to model co-expressions of genes Helpful in reconstructing transcriptional regulatory networks Recently, Gitter and others have learned a latent tree model to explain co-expressions of genes, which can help to reconstruct transcriptional regulatory networks. So, there is a lot of things we can do with latent tree models.

Modelling Co-Occurrences (Correlations) in Marketing Data
Latent variables reveal correlations among customers’ responses to different products (here, beers sold in Demark) Latent states identify customer clusters with different opinions Useful for marketing strategy formation

Outline Part I: Concept and Algorithms Theoretical issues
Basic learning algorithms Scaling up Learning hierarchical models Part II: Applications Hierarchical topic detection Item recommendation with implicit feedback Others

Identifiability Issues
Root change lead to equivalent models So, edge orientations unidentifiable Hence, we are really talking about undirected models Undirected LTM represents an equivalent class of directed LTMs In implementation, represented as directed model instead of MRF so that partition function is always 1. LTMs have some intrinsic identifiability issues. First, changing the root of the model leads to equivalent models. This implies that edge orientations are not identifiable. So, when we talk about latent tree models, we are really talk about undirected models. In implementation, we still represent is as a directed model, instead of a Markov random field, so that the partition function is always 1. (Zhang, JMLR 2004)

Identifiability Issues
|X|: Cardinality of variable X, i.e., the number of states. (Zhang, JMLR 2004) Also, latent variables cannot have too many states. Otherwise, the model would not be identifiable. Theorem: The set of all regular models for a given set of observed variables is finite. Latent variables cannot have too many states.

Latent Tree Analysis (LTA)
Learning latent tree models: Determine Number of latent variables Numbers of possible states for each latent variable Connections among variables Probability distributions

Three Settings for Algorithm Development
Setting 1: CLRG (Choi et al, 2011, Huang et al. 2015) Assume data generated from an unknown LTM. Investigate properties of LTMs and use them for learning E.g., model structure from tree additivity of information distance Theoretical guarantees to recover generative model under conditions Setting 2: EAST, BI (Chen et al, 2012, Liu et al, 2013) Do not assume data generated from an LTM. Try to find model with highest BIC score via search or heuristics Does not make sense to talk about theoretical guarantees Obtains better models than Setting 1 because the assumption usually untrue. .

Setting 1 Assumption: Data generated from an unknown LTM.
Task: Recover the generative model from data

The Strategy Define distance between variables that are additive over trees Estimate distances between observed variables from data Infer model structure from those distance estimates Further Assumptions: Latent variables have equal cardinality, and it is known. In some cases, it equals the cardinality of observed variables. Or, all variables are continuous. Representative algorithms Recursive groping (Choi et al, JMLR 2011) Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988 for phylogenetic trees) The basic idea of distance-based algorithm is to define a distance measure between variables that are additive over trees. I will explain what this means in a moment. When analyzing data, we estimate distances between observed variables from data, and infer model structure from those distance estimates.

Information Distance When both variables are binary:
Information distance between two discrete variables Xi and Xj (Lake 1994) For two discrete variables Xi and Xj with equal cardinality, the information distance between them is defined by this formula. Here, the numerator is the determinant of the joint probability matrix. If both variables are binary, the joint probability matrix is given here: At cell 0-0, we have the probability of xi=0 and xj=0; at the cell 0-1, we have the probability of xi=0 and xj=1; and so on. At the denominator, we have the determinant of the marginal probability matrix of Xi and that of Xj. The marginal probability matrix of Xi is given here: At the cell 0-0, we have the probability of xi=0, and at the cell 1-1, we have the probability xi=1. The off-diagonal cells are 0. When both variables are binary:

Additivity of Information Distance on Trees
Theorem (Erdos, Szekely, Steel, & Warnow, 1999): If a probability distribution factorizes according to a tree, then the information distances between variables are additive with respective to the tree, i.e., It has been shown that the information distance is additive over trees. In the example, the distance between nodes 1 and 2 equal the distance between 1 and h1, plus the distance between h1 and h2, and plus the distance between h2 and 2.

Testing Node Relationships
The additivity of information distance allow us to determine relationships among nodes. For example, if j is a leaf node, and i is a parent of j. Then for any other node k, we have d_jk = d_ij plus d_ik. In other words, d_jk – d_ik =d_ij. This is true no matter what k is. So, the different d_jk – d_ik is a constant. Interestingly, the equality is true only when j is a leaf and i is its parent. In the first figure on the right, j is not a leaf. The equality is not true. In the second figure, i is not the parent of j, the equality is not true either. So, when the equality is true, we know that j is a leaf and i is its parent. This implies the difference is a constant. It does not change with k. Equality not true if j is not leaf, or i is not the parent of j

Testing Node Relationships
If i and j are leaf nodes and they share the same parent, then for any other node k, the difference between d_ji and d_ik is the same as the different between d_jh and d_ih. This implies that the difference d_jk – d_ik is a constant. It does not change with k. And obviously, the constant is between –d_ij and d_ij. This allows us to determine whether two leaf nodes are siblings. For any pair of nodes i and j, if the different d_ik – d_ik does not change with k, then we know that i and j must be leaf nodes and they are siblings. This implies the difference is a constant. It does not change with k. It is between – and This property allows us to determine leaf nodes that are siblings

The Recursive Grouping (RG) Algorithm
RG is an algorithm that determines model structure using the two properties mentioned earlier. Explain RG with an example Assume data generated by the following model Data contain no information about latent variables Task: Recover model structure Recursive is an algorithm for determining model structure using the two properties we just discussed. I will explain the algorithm with an example. We assume this is the true model. Note that the internal node 2 is observed. So, we are talking about general tree models, rather than the basic latent tree model, where all internal nodes are observed. We assume that data on observed are generated from this model, and the data allow us to compute the distance between observed nodes exactly.

Recursive Grouping If c = , then j is a leaf and i is its parent
Step 1: Estimate from data the information distance between each pair of observed variables. Step 2: Identify (leaf, parent-of-leaf) and (leaf siblings) pairs For each i, j If = c (constant) for all k \= i, j If c = , then j is a leaf and i is its parent If < c < , then i and j are leaves and siblings Here is the algorithm. It first estimate information distance between each pairs of observed nodes from data. Then it consider every pairs of nodes i and j, and decide whether it tries to identify leaf and parent-of-leaf pairs, and leaf sibling pairs. To do so, it computes the difference d_jk – d_ik for all other nodes k. If the difference is a constant, and it is the same as d_ij, then j is leaf and i is the parent of j. If the constant if not d_ij, rather it is between – d_ij and d_ij, the I and j are leaf nodes and are siblings. In this example, we are able to determine that, 4 is a leaf and 2 is its parent; and that 5 and 6 are leaves and they are siblings.

Recursive Grouping NOTE: No need to determine model parameters here.
Step 3: Introduce a hidden parent node for each sibling group without a parent. In Step 3, we introduce a new latent variables for each group of nodes that are known to be leaves and siblings of each other. In our example, we introduce a latent variable h1 for 5 and 6. Note that there is no need to determine the parameters here. NOTE: No need to determine model parameters here.

Recursive Grouping Step 4. Compute the information distance for new hidden nodes. Then we calculate the distance between the new nodes and other nodes. For example, the distance between 5 and h1 is the distance between 5 and 6, plus the distance between 5 and 3, and minus the distance between 6 and 3, and divided by 2.

Recursive Grouping Step 5. Remove the identified child nodes and repeat Steps 2-4. Parameters of the final model can be determined using EM if needed. Next we remove nodes that are already know to be the children of others. In our example, we remove 4, 5 and 6. After that, we have 1 2 h1 and 3 to work on. By repeating the procedure, we find out 1 and 2 should be the children of another latent variable, and h1 and 3 should the children of a 3rd latent variable. So we introduce two more latent variables. After that, we remove 1, 2, h1 and 3 from consideration, and we are left with only h2 and h3. We simply connect. The parameters can be determined by EM if necessary.

CL Recursive Grouping (CLRG)
Making Recursive Grouping more efficient Step 1: Construct Chow-Liu tree over observed variables only based on information distance, i.e. maximum spanning tree It turns out that, RG is not quite efficient. CLRG is propose to speed up RG. The basic idea is to first construct a Chou-Liu tree among the observed variables.

Step 2: Select an internal node and its neighbors, and apply the recursive-grouping (RG) algorithm. (Much cheaper) Step 3: Replace the output of RG with the sub-tree spanning the neighborhood. Then apply RG to nodes in neighborhood of each internal.

Repeat Steps 2-3 until all internal nodes are operated on. Theorem: Both RG and CLRG are consistent With sufficient data generated from a LTM, they can recover the generative model correctly with probability. Note: Theorem not applicable if data are not generated from tree model It has been shown that both RG and CLRG are consistent. In other words, they can correctly re-construct the generative structure under some conditions.

Result of CLRG on subset of 20 Newsgroups Dataset
(BIC Score: -116,973) This is the structure the CLRG obtained in the newsgroups data. It is quite interest. For example, in this area, we have all the words about sports. In this area, we have all the words related to computers. And so on. It is more compact than the structure given by BIN-G, but not as compact as the structure given by BI. A special feature is that, it has some observed variables at internal nodes.

Setting 2 Do not assume that data are generated from an LTM.
Does not make sense to talk about theoretical guarantees Strategy: Find model with highest BIC score BIC(m|D) = log P(D|m, θ*) – d/2 logN d : number of free parameters; N is the sample size. θ*: MLE of θ, estimated using the EM algorithm. Likelihood term of BIC: Measure how well the model fits data. Second term: Penalty for model complexity. Representative algorithms EAST (Chen et al, 2012) BI (Chen et al, 2012, Liu et al, 2013)

The Bridged Islands (BI) Algorithm
Partition observed variables into clusters Partition criterion: The clusters must be unidimensional (UD) Learn an latent class model (LCM) for each cluster： Islands Link up the islands using Chow-Liu’s algorithm Mutual information (MI)-based maximum spanning tree over latent variables What we use for step 1 is an algorithm called the bridged islands algorithm or the BI algorithm. BI has 3 steps. In the first step, it partition the observed variables into subsets, and in the second step, it learns a latent class model for each subsets. These latent class models will be referred to as islands. In the 3 step, BI links up the islands to get a flat model. This is done using Chow-Liu’s algorithm. To be more specific, we first estimate the MI between every pair of latent variables, form a complete weighted graph using the MI values as edge weights, and find the maximum spanning tree of the weight graph. A key question with BI is how to divide the observed variables into cluster? What criterion we use for the task? The answer is: the variable clusters must be unidimensional. Next, we explain what this means?

Unidimensionality Unidimensionality (UD) Test
A set of variables is unidimensional if the correlations among them can be properly modeled using a single latent variable Unidimensionality (UD) Test Example: S={A, B, C, D, E} Learn two models m1 : Best LTM with one latent variable m2 : Best LTM with two latent variables UD-test passes if and only if Conceptually, a set of variables is unidimensional if the correlations among them can be properly modeled using a single latent variable. We determine whether a set S is unidimensional using the uni-dimensionality test, or the UD test for short. To perform the test, we learn two models: The first model is the best model among all models that that contain only one latent variable, and the second one is the best model with two latent variables. Then we compare the BIC score of the two models. If the BIC score of the two-latent-variable model does not exceed that of the single-latent-variable model by some threshold delta, then we conclude the data can be properly modeled using a single latent variable, and then the test passes. In other words, if the use of two latent variable does not give us a significantly better model, then we use one latent variable. If the use of two latent variable does not give significantly better model, then we use of one latent variable.

Bayes Factor Bayes factor uses the ratio of marginal likelihood to compare two models:, Strength of evidence in favor of M2 depends on the value of K Kass and Raftery 1995 The UD test is related to the Bayes factor from Statistic. The Bayes factor compares two models M1 and M2 based on data. Unlike the likelihood ratio test, the two models are not required to be nested. The strength of evidence in favor of M1 depends on the value of K. If 2 lnK is from 0 to 2, the evidence is negligible. If 2lnK is between 2 and 6, there is positive evidence in favor of M1. If it is between 6 to 10, there is strong evidence in favor of M1. If it is larger than 10, then there is very strong evidence in factor of M1.

Bayes Factor and UD-Test
The statistic is a large sample approximation of Strength of evidence in favor of two latent variables depends on U : In the UD-test, we usually set : Conclude single latent variable if no strong evidence for >1 latent variables The statistic we use in the UD test, BIC(m2) – BIC(m1) is a large sample approximation of log k. So, U is between 1 and 3, there is positive evidence in favor of two latent variables; If U is between 3 to 5, there is strong evidence in favor of two latent variables. In the UD-test, we usually set delta = 3. This means that we conclude that there should be a single latent variable if there is no strong evidence suggesting multiple latent variables.

Building Islands Build first Island
Start with three variables most strongly correlated. Grow the cluster by adding other strongly correlated variables one by one Perform UD-test after each step Obtain first island when UD-test fails (Optimize cardinality of Y if desired) Repeat on the remaining variables for more islands How do we divide observed variables into unidimensional subsets? In other words, how do we obtain the islands? To answer this question, we need only consider how to build the first island. Because after building the first island, we can all the variables from the data set, and repeat the same process to build the second island, the third island, and so on. To build the first island, we start with three variables that are most strongly correlated. Then we add other variables to the cluster one by one. At each step, we consider the variable that is the most closely related to the variables already in the island, and we perform the UD-test to determine whether the island is still unidimensional if the variable is added. In our running example, assume we start with the variables “nasa, space, shuttle”. Next, we add “mission”. The UD-test passes. So we add another variable “moon”. Again, the UD-test passes. So, we add another variable “lunar”. This time, the UD-test fails. The model with two latent variables is better than the model with one latent variable. So, we stop the process. There are two islands in the 2 latent variable model. We pick the one that does not contain the latest variable, “lunar”, as our first island.

Result of BI on subset of 20 Newsgroups Dataset
(BIC Score: -116,451) Here is the result that BI obtained on the news groups data. The structure is quite interesting. At the lower right corner, we have words related to NASA; Under Y15, we have words about cars; Under Y11-14, we have the sports words. The words on the middle row are mostly related to computers. At the top, from the left to right, we have words, about medicine, food, religion and government. In the comparison with the results obtained by BIN-G, here a latent variable can be connected to multiple observed variables. This makes it more compact. A more detailed comparison will be given later.

Empirical Comparisons

Empirical Comparisons
Algorithms compared Setting 1 Algorithms CLRG (Choi et al, JMRL 2011) CLNJ (Saitou & Nei, 1987) Setting 2 Algorithms: EAST (Chen et al, AIJ 2011) BIN (Harmeling & Williams, PAMI 2011) BI (Liu et al. MLJ, 2013) Data Synthetic data Real-world data Measurements Running time Model quality I will empirically compare 5 algorithms: EAST, BIN, BI, CLRG and CLNJ. I will use both synthetic data and real-world data, and both running time and model quality will be considered.

Generative Models 4-complete model (M4C):
Every latent node has 4 neighbors All variables are binary Parameter values randomly generated such normalized MI between each pair of neighbor is between 0.05 and 0.75. I will first present results on synthetic data. To get synthetic data, we need generative models. This is one of the generative models. It is called the 4-complete model, or M4C, because every latent node has 4 neighbors. All the variables are binary, and the parameter values are generated randomly.

Generative Models M4CF: Obtained from M4C
More variables added such that each latent node has 3 observed neighbors . A flat model. It is a flat model. Other models and the total number of observed variables Here is another generative model. It is obtained from M4C by adding more observed variables so that each latent node has 3 observed neighbors. It is hence a flat model. The code name for this model is M4CF, where F stands for flat. A number of other generative models are considered. This table show the number of observed variables in each of them. We see that there are 36 observed variables in M4C, 51 in M4CF, 90 in M5C, 104 in M5CF, 252 in M7C, and 300 in M7CF.

Synthetic Data and Evaluation Criteria
Training: 5,000; Testing: 5,000 No information on latent variables Evaluation Criteria: Distribution m0 : generative model; m : learned model Empirical KL divergence on testing data: The smaller the better. Second term is hold-out likelihood of m. The larger the better. From each generative model, a training set of 5,000 observations and a testing set of 5,000 observations are sampled. The data contain information about the observed variables, but not the latent variables. Each of the algorithms is run on the training set, and the resulting model m is compared with the generative model m0 empirical KL divergence on the testing data. It is given by this formula. The first term is the hold-out likelihood for the generative model. The second term is hold-out likelihood. Hold-out likelihood measures how well the model m predicts unseen data. The smaller the KL value, the better the model m.

Evaluation Criteria: Structure
Example: For the two models on the right m0 Y2-Y1: X1X2X3 | X4X5X6X7 Y1-Y3: X1X2X3X4 | X5X6X7 X1-Y2: X1 | X2X3X4X5X6X7 .. m Y2-Y1: X1X2 | X3X4X5X6X7 Y1-Y3: X1X2X3X4 | X 5X6X7 dRF = (1 + 1)/2 = 1 Empirical KL compares models in terms of their distributions they represent. It is also interesting to compare model structures, that is to see how similar the learned structure is to the generative structure. For this purpose, we use the Robinson-Foulds distance, which measures how two tree structures differ. The RF distance between the two models m and m0 is given by the formation. Here C(m), is the set of bipartitions of observed variable given by edges of m. As an example, consider the model m0, on the right. The edge Y2-Y1 gives us this partition; the edge Y1-Y3 gives us this partition; the edge X1- Y2 gives us this partition; and so on. There are totally 9 edged, and hence 9 partitions. The two models share most partition, such as those given by the edges Y1-Y3 and X1-Y2. Model m has only one partition not shared by m0, i.e., the one marked blue. So, the first term on the numerator is 1. Similary, m0 has one partition not shared by m, i.e., the one marked blue. So, the RF distance between the two structure is , divided 2, which is 1. Not defined for forests

Running Times (Seconds)
Now, let me present the empirical results. This is the running time statistics. We see that EAST is by far the slowest. It is unable to manage data sets with >100 observed variables. CLRG is the fastest, followed by CLNJ, BIN and BI. The running times for those 4 algorithm increase more or less at the same rate when the number of observed variables increases. EAST was too slow on data sets with more than 100 attributes. CLRG was the fastest, followed by CLNJ, BIN and BI.

Model Quality Flat generative models Non-flat generative models
EAST found best models on data with dozens of attributes BI found best models on data with hundreds of attributes. BIN is the worst. (No RF values because it produces forests, not trees) In term of model quality, EAST is the best on the data sets it can manage (show red rings). On all the other data sets, BI found the best models, both in terms of both empirical KL divergence from the generative model and the RF distance. The difference are large on M5CF, M7CF, and M7C (shown purple rings). Among the other three algorithms, BIN is the worst. And there are no RF values for BIN because it produces forests, not trees. RF is not defined for forests.

When latent variables have different cardinalities …
Make latent variables have different cardinalities in generative models 3 for those at levels 1 and 3 2 for those at level 2. All algorithms perform worse than before EAST and BI still found best models. CLRG and CLNJ especially bad on M7CF1. They assume all latent variables have equal cardinality. In the generative model considered so far, all latent variables have 2 possible states. What if different latent variables have different cardinalities? In this case, all algorithms perform worse than before. CLRG and CLNJ are especially bad on M7CF1, probably because of the assumption that all latent variables have equal cardinality.

Real-World Data Sets Data Evaluation criteria:
BIC score on training set: measures of model fit Loglikelihood on testing set (held-out likelihood): measures how well learned model predict unseen data. Next, let us see results on real-world data sets. 4 data sets are used, Coil, Alarm, news-100, webKB. They have between 42 and 336 observed variables. The sample size ranges from a few hundred to thousands. For evaluation, we use both BIC score on training set and loglikelihood on testing set. BIC score on training set measure how well a model fits the training data, and likelihood on testing sets measures how it predicts unseen data.

Running Times (seconds)
Here is the running time statistics. CLNJ and CLRG are not applicable to Coil-42 and Alarm because different attributes have different cardinalities. EAST did not finish on News-100 and WebKB. We see that CLRG is the fastest, follwoed bug CNNJ, BIN and BI. This is the consistent with the results on synthetic data. CLNJ and CLRG not applicable to Coil-42 and Alarm because different attributes have different cardinalities. EAST did not finish on News-100 and WebKB within 60 hours CLRG was the fastest, followed by CLNJ, BIN and BI.

Model Quality EAST and BI found best models. BIN found the worst.
Here are model quality statistics. On Coil-42 and Alarm, EAST found the best models (show red rings). On News-100 and WebKB, BI found the best model. In all cases, BIN performed the worst. The structures of the models for News-100 are shown earlier. We saw that BI introduced far fewer latent variables in other algorithms. So, the model it obtains is more compact model than the other algorithms. EAST and BI found best models. BIN found the worst.

Summary of Empirical Comparisons
Setting 2 algorithms EAST and BI found better models than setting 1 algorithms (CLRG and CLNJ) because the latter: Need the assumption that data are generated from tree models to work correctly, but the assumption is seldom true Produce models with observed variables at internal nodes, which imply unrealistic independence assumptions To summarize, ….

Basic learning algorithms Scaling up BI Learning hierarchical models Part II: Applications Hierarchical topic detection Item recommendation with implicit feedback Others

BI does not Scale Up BI does scale up well because
It examines a large number of intermediate models during structure learning, The parameters of those models are estimated using EM, which is slow To scale up, we need A fast parameter estimation method for intermediate models. It does not need to be very good in terms parameter quality.

Method of Moments Theorem (Zhang et al. PGM 2014) Estimate P(B|Y)
Idea: Relates model parameters with marginal distribution (Pearson 1894, Anandkumar et al. COLT 2012) Theorem (Zhang et al. PGM 2014) PAC: matrix for P(A, C) PA|Y: matrix for P(A|Y) PAbC: matrix for P(A, B=b, C) PbIY: vector for P(B=b|Y) Estimate P(B|Y) Obtain empirical distributions P^ABC and P^AC from data (moments) Solve equation (finding eigenvalues of matrix on RHS) to get PbIY Only involves 3 observed variables The basic idea of method of moments is to relate model parameters with marginal distributions of a small number of observed variables, and estimate the parameters using relationship and empirical marginals. In particular, for the model shown here, we have this equation. Here P_AC is the matrix representation of the joint distribution of AC; P_A|Y is the matrix representation of the conditional distribution of A given Y. P_AbC stands for the joint distributions of A, B and C. The value of B is fixed at a particular value and hence it can also be represented as a matrix. Finally, P_b|Y is the conditional distribution of B taking value b given Y. It is a vector. Diag(P_b|Y) is a diagonal matrix with the vector as the diagonal, and the off-diagonal entries are all zero. The equation can be used to estimate P(B|Y) as follows. First, we obtain the empirical distributions of ABC and AC and form the matrix on the right hand side. Then we find the eigenvalues of the matrix. They are the entries of the vector P_b|Y. Note that this method involves on 3 observed variables. If you are familiar with MoM, please note that the setting here is different from the setting that you are used to. Here, we are talking about three different binary variables. In the more common setting, each variable is a vector and you talk about the outer product of three copies of the same vector. But the essential ingredients are the same: need to compute the inversion of joint of two variables, combine it with the joint of three, and find eigenvalue/eigenvectors.

Progressive Parameter Estimation
Multi-step process, with a small number of parameters estimated at each step. P(B|Y) in red part as described above P(A|Y) in red part (swap roles of variables) P(C|Y) in red part (swap roles of variables) P(Y): diag(PY)= PA|Y-1 PAC (PC|Y-1 )T P(D|Y), P(E|Y) in blue part P(Z|Y) P(E|Y) in purple part P(E|Z) in green part P(Z|Y): PZ|Y= PE|Z-1 PE|Y Using the method, we can estimate all the parameters in stages. First, we focus on the part of the model inside the red enclosure. Here, we can first estimate P(B|Y) as just described. Then, by swapping the roles of the variables, we can also estimate P(A|Y) and P(C|Y). After that, we can calculate the marginal distribution P(Y) from P_A|Y, P_C|Y and P_AC. Next, we shift our attention to the part of the model inside the blue enclosure. There, we can estimate P(D|Y) and P(E|Y). We call this way of estimating parameters in multiple steps progressive parameter estimation. It can also be applied to more complex model such as the one shown at the bottom.

Method of Moments Works if Breaks down if
Model structure matches “true model” behind data Sample size large enough So that P^ABC and P^AC close to PABC and PAC Breaks down if Data not from an LTM or not from the LTM being estimated, or Sample size is not large enough In such a case (almost always this case), Produces poor estimates, Can even give negative values for probability An alternative that avoids the problem: Progressive EM The method requires two condition to work. First, we are talking about parameter estimation. So, there is a fixed model structure. That model must match the true model behind data. Second, the sample size must be large enough. Only under those two conditions, the empirical marginal can be good approximation of the true marginal, and we can get high quality estimates. However, the first condition is almost always violated. Our data might not be from the latent tree model at all, or might not be from the latent tree model being estimated. Consequently, method of moments often produce poor estimates and can even give negative values for probability. To overcome this difficulty, we propose progressive EM.

Progressive EM Idea: Run EM multiple times, each in a small part of model (Chen et al. AAAI 2016) Example 1 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM on red part Then, P(D|Y) and P(E|Y) by run EM on blue part with P(Y), and P(C|Y) fixed. Example 2 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM on purple part Then, P(Z|Y), P(C|Z), P(E|C) in green part with other parameters fixed Quality: Never give negative probabilities Efficiency: Data set consists of 8 or 16 distinct cases when projected to 3 or 4 binary variables! The idea is to run EM multiple times, and each time in a small part of the model. For example, we can run EM in the part inside red, and estimate P(Y), P(A|Y), P(B|Y), and P(C|Y). Then we run EM again in the part inside blue to estimate P(D|Y) and P(E|Y). Here, we can keep P(C|Y) fixed. For the more complex model at the bottom, we can first run EM inside the purple rectangle, and then run EM inside the green rectangle. In comparison with method of moments, progressive EM give better estimates. It never gives negative probability values. Equally importantly, it is very efficient, because all we do is to consider sub-models with 3 or 4 observed variables. No matter large your dataset is, when projected to 3 or 4 variables, you get only 8 or 16 distinct cases. THIS IS WHY our method can scale up.

Progressive EM in Island Building
Model with parameters previously estimated Add “lunar” P(lunar|Y) by EM in read part P(lunar|Z), P(moon|Z), P(Z|Y) by EM in blue part Only 3 or 4 observed variables involved It is very easy to incorporate progressive EM into the process of island building. Suppose we have already built the island on top. All the parameters have been estimated. Next, we consider adding a new variable “lunar” to the island. To do that, we consider those two models. In the first model, we need to estimate P(lunar|Y). It can be done by running EM in the red part with all other parameters fixed. In the second model, we need P(lunar|Z), P(moon|Z), P(Z|Y). We can estimate them in the blue part with all other parameters fixed. Note that, in both case, we have no more than 4 observed variables to deal with.

Hierarchical Latent Tree Analysis (HLTA)
Learn flat LTM using BI: Each latent connected to observed variables Turn top latent variables into observed variables. Repeat step 1. Stack model from step 2 on top model from step 1 Repeat 2-3 until termination Global parameter optimization (stepwise EM) Phase 1: Model Construction Here is the top level control of HLTA. It has 5 steps. The first step is to learn a flat latent tree model. Here, a flat model is one where each latent variable is connected to some observed variables. In a hierarchical model, however, only the latent variable at the lowest level are connected to observed variables, and other latent variables aren’t. This is the key step. I will say more about it later. The second step is to turn the latent variables in the flat model into observed variable through data completion, and then repeat step 1 on those variables. This way we get another flat model that looks like this. The observed variables in this model are the latent variables in the first model. Next, we stack the 2nd model on top of the first model, and thereby get a model with two layers of latent variables. Then, we repeat steps 2-3 until a termination condition is met. The condition is usually an upper bound on the number of latent variables at the top. Finally, we optimize the parameters of the global model. Conceptually, the algorithm consists of two phases. The first phase is model construction, and it consists of the first 4 steps. The second phase is parameter estimation, and it consists of step 5. Phase 2: Parameter Estimation

Estimating Parameters of the Final Model
EM is too slow. Use Stepwise EM (Liang et al. ACL 2009): Similar to stochastic gradient descent Divide data into minibacthes In each iteration, update parameters after processing each minibatch. So, multiple updates in one iteration

Current Capabilities

Summary of Algorithms Setting 1: CLRG (Choi et al, 2011, Huang et al. 2015) Assume that data generated from an unknown LTM. Investigate properties of LTMs and use them for learning E.g., model structure from tree additivity of information distance Theoretical guarantees to recover generative model under conditions Setting 2: EAST, BI (Chen et al, 2012, Liu et al, 2013) Do not assume that data generated from an LTM. Find model with hightest BIC score via search or heuristics Does not make sense to talk about theoretical guarantees Obtains better models than Setting 1 because the assumption usually untrue. Setting 3: HLTA (Liu et al, 2014, Chen et al, 2016) Consider usefulness in addition to model fit. Hierarchy of latent variables. .

Latent Tree Models Part I: Concept and Algorithms

Similar presentations

Presentation on theme: "Latent Tree Models Part I: Concept and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Tree Models Part I: Concept and Algorithms

Similar presentations

Presentation on theme: "Latent Tree Models Part I: Concept and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback