Latent Structure Models and Statistical Foundation for TCM Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
INCOB 2007/ Slide 2 Outline l Hierarchical Latent Class (HLC) Models Hierarchical Latent Class (HLC) Models l Motivation Motivation l Empirical Results on TCM Data Empirical Results on TCM Data l Empirical Results on Other Data Empirical Results on Other Data l Conclusions Conclusions
INCOB 2007/ Slide 3 Hierarchical Latent Class (HLC) Models l Bayesian networks with n Rooted tree structure n Discrete random variables n Leaves observed (manifest variables) n Internal nodes latent (latent variables) l Renamed latent tree models
INCOB 2007/ Slide 4 Example l Manifest variables n Math Grade, Science Grade, Literature Grade, History Grade l Latent variables n Analytic Skill, Literal Skill, Intelligence
INCOB 2007/ Slide 5 Learning Latent Tree Models: The Problem Y1Y2…Y6Y7 10…11 11…00 01…01 …………… Determine l Number of latent variables l Cardinality of each latent variable l Model Structure l Conditional probability distributions Two perspectives l Latent structure discovery l Multidimensional clustering n Generalizing latent class analysis
INCOB 2007/ Slide 6 Learning Latent Tree Models: The Algorithms l Model Selection n Several scores examined: BIC, BICe, CS, AIC, holdout likelihood n BIC: best choice for the time being l Model optimization n Double hill climbing (DHC), 2002 7 manifest variables. n Single hill climbing (SHC), 2004 12 manifest variables n Heuristic SHC (HSHC), 2004 50 manifest variables n EAST, 2007 As efficient as HSHC, and finds better models n EAST + Divide-and-Conquer 100+ manifest variables
Illustration of the search process
INCOB 2007/ Slide 8 Motivation l Latent structure discovery and multidimensional clustering are potentially useful in many applications. l Our work driven by research on traditional Chinese medicine (TCM)
INCOB 2007/ Slide 9 What is there to be done? l TCM statement: n Yang deficiency ( 阳虚 ): intolerance to cold ( 畏寒 ), cold limbs ( 肢冷 ), cold lumbus and back ( 腰背冷 ), and so on …. n Regarded by many as not scientific, even groundless. l Two aspects to the meaning 1. Claim: There exists a class of patients, who characteristically have the cold symptoms. The cold symptoms co-occur in a group of people, 2. Explanation offered: Due to deficiency of Yang. It fails to warm the body l What to do? n Previous work focused on 2. n New idea: Do data analysis for 1
INCOB 2007/ Slide 10 Objectivity of the Claimed Pattern l TCM Claim: there exits a class of patients, in whom symptoms such as ‘intolerance to cold’, ‘cold limbs’, ‘cold lumbus and back’, and so on co-occur at the same time l How to prove or disapprove that such claimed TCM classes exist in the world? n Systematically collect data about symptoms of patients. n Perform cluster analysis, obtain natural clusters of patients n If the natural clusters corresponds to the TCM classes, then YES. 1.Existence of TCM classes validated 2.Descriptions of TCM classes refined and systematically expanded 3.Establish a statistical foundation for TCM
INCOB 2007/ Slide 11 Why Latent Tree Models? l TCM uses multiple interrelated latent concepts to explain co-occurrence of symptoms n Yang deficiency ( 肾阳虚 ), Yin deficiency ( 肾阴虚 ):, Essence insufficiency ( 肾 精亏虚 ), … l Need latent structure models n With multiple interrelated latent variables.. l Latent Tree Models are the simplest such models
INCOB 2007/ Slide 12 Empirical Results l Can we find the claimed TCM classes using latent tree models? n We collected a data set about kidney deficiency ( 肾虚 ) n 35 symptom variables, 2600 records
Result of Data Analysis l Y0-Y34: manifest variables from data l X0-X13: latent variables introduced by data analysis l Structure interesting, supports TCM’s theories about various symptoms.
INCOB 2007/ Slide 14 Latent Clusters l X1: n 5 states: s0, s1, s2, s3, s4 n Samples grouped into 5 clusters l Cluster X1=s4 {sample | P(X1=s4|sample) > 0.95} Cold symptoms co-occur in samples l Class implicitly claimed by TCM found! l Description of class refined n By Math vs by words
INCOB 2007/ Slide 15 Other TCM Data Sets l From Beijing U of TCM, 973 project n Depression Depression n Hepatitis B Hepatitis B n Chronic Renal Failure Chronic Renal Failure n Other data to be analyzed l China Academy of TCM n Subhealth Subhealth n Type 2 Diabetes Type 2 Diabetes n More analysis to come under a new 973 project l In all cases, claimed TCM classes n Validated n Quantified and refined
INCOB 2007/ Slide 16 Results on a Marketing Data Set l CoiL Challenge 2000 l Customer records of a Holland Insurance Company l 42 manifest variables, 5822 records
INCOB 2007/ Slide 17 Results on a Danish Beer Data l Market Research l 783 samples l States of Manifest variables n 1. Never heard of; 2. heard but not tasted; n 3. tasted but don’t drink regularly; 4. drink regularly
INCOB 2007/ Slide 18 Result on a Survey Data Set l Survey on corruption l 31 manifest variables, records
INCOB 2007/ Slide 19 Conclusions l Latent tree models, and latent structure models in general, n Offer framework for latent structure discovery and multidimensional clustering. n Can play a fundamental role in modernizing TCM n Can be useful in many other areas such as marketing, survey studies, …. l We have only scratched the surface. A lot of interesting research work is yet to be done.