Lecture 15: Hierarchical Latent Class Models Based ON N. L. Zhang (2002). Hierarchical latent class models for cluster analysis. Journal of Machine Learning.

Slides:



Advertisements
Similar presentations
Parsimony Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Biointelligence Laboratory, Seoul National University
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Phylogenetic Trees Lecture 4
Latent Structure Models and Statistical Foundation for TCM Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 16: Wrap-Up COMP 538 Introduction of Bayesian networks.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Phylogenetic Trees Presenter: Michael Tung
L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Latent Tree Models Part II: Definition and Properties
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Molecular phylogenetics
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Pricing Combinatorial Markets for Tournaments Presented by Rory Kulz.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Latent Tree Models & Statistical Foundation for TCM Nevin L. Zhang Joint Work with: Chen Tao, Wang Yi, Yuan Shihong Department of Computer Science & Engineering.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Lecture 2: Statistical learning primer for biologists
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Machine Learning 5. Parametric Methods.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Latent variable discovery in classification models
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Consistent and Efficient Reconstruction of Latent Tree Models
Distance based phylogenetics
Clustering methods Tree building methods for distance-based trees
Combining Species Occupancy Models and Boosted Regression Trees
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
6. Introduction to nonparametric clustering
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Algorithms for Inferring the Tree of Life
Presentation transcript:

Lecture 15: Hierarchical Latent Class Models Based ON N. L. Zhang (2002). Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research, to appear. COMP 538 Introduction of Bayesian networks

2 Outline Motivation Application of LCA in medicine Model-based clustering and TCM diagnosis Need of more general models Theoretical Issues Learning Algorithm Empirical Results Related work

3 Motivations/LCA in Medicine In medical diagnosis, sometimes gold standard exists Example: Lung Cancer Symptoms: Persistent cough, Hemoptysis (Coughing up blood), Constant chest pain, Shortness of breath, Fatigue, etc Information for diagnosis: symptoms, medical history, smoking history, X-ray, sputum. Gold standard: Biopsy: the removal of a small sample of tissue for examination under a microscope by a pathologist

4 Sometimes gold standard does not exist Example: Rheumatoid Arthritis (RA) Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint Stiffness, etc Information for diagnosis: Symptoms, medical history, physical exam, Lab tests including a test for rheumatoid factor. (Rheumatoid factor is an antibody found in the blood of about 80 percent of adults with RA. ) No gold standard: None of the symptoms or their combinations are not clear-cut indicators of RA The presence or absence of rheumatoid factor does not indicate that one has RA. Motivations/LCA in Medicine

5 Questions: How many diagnostic categories there should be? What rules to use when making diagnosis? Note: These questions cannot be answered using regression (supervised learning) because The true “disease type” is never directly observed. It is latent. Ideas: Each “disease type” must correspond to a cluster of people. People in different clusters demonstrate different symptom patterns (otherwise diagnosis is hopeless) Possible solution: Perform cluster analysis of symptom data to reveal patterns. Motivations/LCA in Medicine

6 Latent class analysis (LCA) Cluster analysis based on the latent class (LC) model Observed variables Y_j: symptoms Latent variable X: “disease type” Assumption: Y_j’s independent of each other given X Given: Data on Y_j Determine: Number of states for X Prevalence: P(X) Class specific probability P(Y_j|X) X Y1Y1 Y2Y2 YpYp Motivations/LCA in Medicine

7 LC Analysis of Hannover Rheumatoid Arthritis Data Class specific probabilities Cluster 1: “disease” free Cluster 2: “back-pain type” Cluster 3: “Joint type” Cluster 4: “Severe type” Motivations/LCA in Medicine

8 Diagnosis in traditional Chinese Medicine (TCM) Example: deficiency of kidney( 肾虚 ),  Symptoms: lassitude in the loins ( 腰酸软而痛 ), tinnitus( 耳鸣 ), driping urine ( 小便余沥不尽 ), etc Similar to Rheumatoid Arthritis Diagnosis based on symptoms No gold standards exist Model-Based Clustering and TCM diagnosis

9 Current status Researcher have been searching for laboratory indices that can serve as gold standards. All such effort failed. In practice, quite subjective. Differ considerably between doctors. Hindering practices and preventing international recognition. Model-Based Clustering and TCM diagnosis

10 How to lay TCM diagnosis on a scientific foundation? Model-based cluster analysis Statistical methods might be the answer: TCM diagnosis based on experiences (by contemporary practitioners and ancient doctors) Experiences are summaries of patient cases. Summarizing patient cases by humans braining leads to subjectivity. Summarizing patient cases by computer avoids subjectivity. Model-Based Clustering and TCM diagnosis

11 Need of more general Models Preliminary analysis of TCM data using LCA: Could not find models that fit data well Reason: latent class (LC) models are too simplistic Local independence: Observed variables mutually independent given the latent variable Need: more realistic models

12 Hierarchical latent class (HLC) models: Tree structured Bayesian networks, where Leaf nodes are observed and others are not Manifest variables = observed variables Maybe still too simplistic, but a good first step More general than LC models Nice computational properties Task: Learn HLC models from data Learn latent structures from what we can observe. Need of more general Models

13 Theoretical Issues What latent structures can be learned from data? An HLC model M is parsimonious if there does NOT exist another model M' that Is marginally equivalent to M, and P(manifest vars|M) = P(manifest vars|M’) Has fewer independent parameters than M. Occam’s razor prefers parsimonious models over non-parsimonious ones

14 Theoretical Issues Regular HLC models HLC model is regular if for any latent node Z with neighbors X1, X2, …, Xk where strict inequality hold when there are only two neighbors Irregular models are not parsimonious. (Operational characterization of parsimony) The set of all possible regular HLC models for a given set of manifest variables is finite. (Finite search space for learning algorithm.)

15 Theoretical Issues Model Equivalence Root walking M1: root walks to X2 M2: root walks to X3 Root walking leads to equivalent models

16 Theoretical Issues Unrooted HLC models The root of an HLC model can walk to any latent node. Unrooted model: HLC models with undirected edges. We can only learn unrooted models. Question: which latent node should be the class node? Answer: Any, depending on semantics and purpose of clustering. Learn one model for multiple clustering.

17 Theoretical Issues Measure of model complexity When no latent variables: number of free parameters (standard dimension) When latent variables: effective dimension instead P(Y1, Y2, …, Yn) spans 2 n –1 dimensional space S if no constraints. HLC model imposes some constraints on the joint It spans a subspace of S Effective dimension of model: dimension of S. HARD to compute.

18 Theoretical Issues Reduction Theorem for regular HLC models (Kocka and Zhang 2002): D(M) = D(M1) + D(M2) – number of common parameters Problem reduces to: effective dimension of LC models. Good approximation exists.

19 Theoretical Issues Example Standard dimension: 110 Effective dimension: 61

20 Learning HLC Models Given: i.i.d. samples generated by some regular HLC model. Task: Reconstruct the HLC model from data. Hill-climbing algorithm Scoring metric: We experiment with AIC,BIC, CS, Holdout LS (yet to run experiments with effective dimension) Search space: Set of all possible regular HLC models for the given manifest variables. We structure the space into two levels according to two subtasks Given a model structure, estimate cardinalities of latent variables. Find a optimal model structure.

21 Learning HLC Models Estimate cardinalities of latent variables given model structure Search space: All regular models with the given model structure. Hill-climbing: Start: All latent variables have minimum cardinality (usually 2) Search operator: Increate the cardinality of one latent variable by one

22 Learning HLC Models Find optimal model structures Search space: Set of all regular unrooted HLC model structures for the given manifest variables. Hill-Climbing: Start: unrooted LC model structure Search operators: Node introduction, Node elimination, Neighbor relocation Can go between any two model structures using those operators.

23 Learning HLC Models Motivations for search Operators: Node introduction: M1’  M2’. Deal with local dependence. Opposite: Node elimination. Neighbor relocation: M2’  M3’. Result of tradeoff. Opposite. Itself. Not allowed to yield irregular model structures.

24 Empirical Results Synthetic data : Generative model, randomly parameterized All latent variables have 3 states. Sample sizes: 5k, 10k, 50k, 100k Log scores on testing data Close to that of generative model Do not vary much across scoring metrics.

25 Empirical Results Learned structures: Numbers of steps to true structure

26 Empirical Results Cardinality of Latent variables Better results with more skewed parameters

27 Empirical Results Hannover Rheumatoid Arthritis data: 5 binary manifest variables: back pain, neck pain, joint swelling, … 7,162 records Analysis by Kohlmann and Formann (1997): 4 class LC model. Our algorithm: exactly the same model. Coleman data 4 binary manifest variables, 3,398 records. Analysis by Goodman (1974) and Hagenaars (1988): M1, M2 Our algorithm: M3

28 Empirical Results HIV data 4 binary manifest variables, 428 records Analysis by Uebesax (2000):  Our algorithm:  House Building data 4 binary manifest variable, 1185 Records Analysis by Hagenaars (1988): M2, M3, M4 Our algorithm: 4 class LC model, fits data poorly. A failure. Reason: limitation of HLC models

29 Related Work Phylogenetic trees: Represent relationship between a set of species. Probabilistic model: Taxa aligned, sites evolves i.i.d Conditional probs: character evolution model. Parameters: edge lengths, representing time. Restricted to one site, a phylogenetic tree is a HLC model where Binary tree structure, same state space for all vars. The conditional probabilities are parameterized by edge lengths The model is the same for different sites AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT

30 Related Work Tree reconstruction: Given: current taxa. Find: tree topology and edge lengths. Methods Hill-climbing Stepwise addition of sites Star decomposition, similar to node introduction in HLC models. Branch swapping, similar to neighbor relocation in HLC models Structural EM ( Friedman et al 2002): Use fact: All vars have same state space Neighbor joining ( Saitou & Nei, 1987 ): Use facts: parameters = edge lengths, additivity.

31 Related Work Connolly (1993): Heuristic method for constructing HLC models Mutual information used to group variables One latent variable introduced for each group. Cardinalities of latent variables determined using conceptual clustering Martin and VanLehn (1994): Heuristic method for learning two-level Bayesian network where the top level is latent. Elidan et al. (2001): Learning latent variables for general Bayesian networks. Aim: Simplification. Idea: Structural signature. Model-based hierarchical clustering (Hansen et al. 1991): Hierarchical the state space for ONE cluster variable.

32 Related Work Diagnostics for local dependence in LC models: Hagenaars (1988): Standardized residual Espeland & Handelmann (1988) Likelihood ratio statistic Garret & Zeger (2000) Log odds ratio Modeling local dependence in LC models Joint variable (M2), multiple indicator (M3), loglinear model (M4)