L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology http://www.cse.ust.hk/~lzhang/

Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Traditional Uses l Probabilistic Expert Systems n Diagnostic n Prediction l Example: BN for diagnosing “blue baby” over phone in a London Hospital Comparable to specialist, Better than others

Traditional Uses l Language for describing probabilistic models in Science & Engineering l Example: BN for turbo code

Traditional Uses l Language for describing probabilistic models in Science & Engineering l Example: BN from Bioinformatics

BN for Structure Discovery l Given: Data set D on variables X1, X2, …, Xn l Discover dependence, independence, and even causal relationship among the variable. l Example: Evolution trees

Phylogenetic Trees l Assumption n All organisms on Earth have a common ancestor n This implies that any set of species is related. l Phylogeny n The relationship between any set of species. l Phylogenetic tree n Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree  this is not always true

Phylogenetic Trees lPhylogenetic trees giant panda lesser panda moose goshawk vulture duck alligator Time Current-day species at bottom

Phylogenetic Trees l TAXA (sequences) identify species l Edge lengths represent evolution time l Assumption: bifurcating tree topology Time AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT

l Characterize relationship between taxa using substitution probability: P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), …. Probabilistic Models of Evolution s3s3 s4s4 s1s1 s2s2 t5t5 t6t6 t1t1 t2t2 t3t3 t4t4 x5x5 x6x6 x7x7

l What should P(x|y, t) be? l Two assumptions of commonly used models n There are only substitutions, no insertions/deletions (aligned)  One-to-one correspondence between sites in different sequences n Each site evolves independently and identically l P(x|y, t) = ∏ i=1 to m P(x(i) | y(i), t) n m is sequence length Probabilistic Models of Evolution AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT

l What should P(x(i)|y(i), t) be? n Jukes-Cantor (Character Evolution) Model [1969]  Rate of substitution a (Constant or parameter?) l Multiplicativity (lack of memory) Probabilistic Models of Evolution rtrt stst stst stst stst rtrt stst stst stst stst rtrt stst stst stst stst rtrt A C G T ACGT r t = 1/4 (1 + 3e -4  t ) s t = 1/4 (1 - e -4  t ) Limit values when t = 0 or t = infinity?

Tree Reconstruction l Given: collection of current- day taxa l Find: tree n Tree topology: T n Edge lengths: t l Maximum likelihood n Find tree to maximize P(data | tree) AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

l When restricted to one particular site, a phylogenetic tree is an LT model where n The structure is a binary tree and variables share the same state space. n The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization. n The model is the same for different sites Tree Reconstruction AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT

Tree Reconstruction l Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT l Samples for LT model. One Sample per site. The samples are i.i.d. n 1st site: (A, T, T, A, A), n 2nd site: (G, A, A, G, G), n 3rd site: (G, G, G, C, C), AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT

Tree Reconstruction l Finding ML phylogenetic tree == Finding ML LT model l Model space: n Model structures: binary tree where all variables share the same state space, which is known. n Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters). l The objective is to find relationships among variables. l Applying new LTM algorithms to Phylogenetic tree reconstruction?

BN for Density Estimation l Given: Data set D on variables X1, X2, …, Xn l Estimate: P(X1, X2, …, Xn) under some constraints l.. l Uses of the estimate: n Inference n Classification

BN Methods for Density Estimation l Chow-Liu tree with X1, X2, …, Xn as nodes n Easy to compute n Easy to use n Might not be good estimation of “true” distribution l BN with X1, X2, …, Xn as nodes n Can be good estimation of “true” distribution. n Might be difficult to find n Might be complex to use

BN Methods for Density Estimation l LC model with X1, X2, …, Xn as manifest variables (Lowd and Domingos 2005) n Determine the cardinality of the latent variable using hold-out validation, n Optimize the parameters using EM. n.. n Easy to compute n Can be good estimation of “true” distribution n Might be complex to use (cardinality of latent variable might be very large)

BN Methods for Density Estimation l LT model for density estimation l Pearl 1988: As model over manifest variables, LTMs n Are computationally very simple to work with. n Can represent complex relationships among manifest variables.

BN Methods for Density Estimation l New approximate inference algorithm for Bayesian networks (Wang, Zhang and Chen, AAAI 08, JAIR 32: 879-900, 08 ) SampleLearn sparse sparse dense dense

Bayesian Networks for Classification l The problem: n Given data: n Find mapping  (A1, A2, …, An) |- C l Possible solutions n ANN n Decision tree (Quinlan) n … n (SVM: Continuous data) A1A2…AnC 01…0T 10…1F..

Bayesian Networks for Classification l Naïve Bayes model n From data, learn  P(C), P(Ai|C) n Classification  arg max_c P(C=c|A1=a1, …, An=an) n Very good in practice

l Drawback of NB: n Attributes mutually independent given class variable n Often violated, leading to double counting. l Fixes: n General BN classifiers n Tree augmented Naïve Bayes (TAN) models n Hierarchical NB n Bayes rule + Density Estimation n … Bayesian Networks for Classification

l General BN classifier n Treat class variable just as another variable n Learn a BN. n Classify the next instance based on values of variables in the Markov blanket of the class variable. n Pretty bad because it does not utilize all available information because of Markov boundary Bayesian Networks for Classification

Bayesian Networks for Classification l TAN model n Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29:131-163.Bayesian networks classifiers. n Capture dependence among attributes using a tree structure. n During learning,  First learn a tree among attributes: use Chow-Liu algorithm  Add class variable and estimate parameters n Classification  arg max_c P(C=c|A1=a1, …, An=an)

Bayesian Networks for Classification l Hierarchical Naïve Bayes models n N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear.Latent variable discovery in classification models. n Capture dependence among attributes using latent variables n Detect interesting latent structures besides classification n Algorithm in the step of DHC.. n …

Bayesian Networks for Classification l Bayes Rule l.l. n Chow-Liu n LC model n LT Model l Wang Yi: Bayes rule + LT model is for superior

BN for Clustering l Latent class (LC) model n One latent variable n A set of manifest variables l Conditional Independence Assumption: n Xi’s mutually independent given Y. n Also known as Local Independence Assumption l Used for cluster analysis of categorical data n Determine cardinality of Y: number of clusters n Determine P(Xi|Y): characteristics of clusters

BN for Clustering Clustering Criteria l Distance based clustering: n Minimizes intra-cluster variation and/or maximizes inter-cluster variation l LC Model-based clustering: n The criterion follows from the conditional independence assumption n Divide data into clusters such that, in each cluster, manifest variables are mutually independent under the empirical distribution.

BN for Clustering l Local independence assumption often not true l LT models generalize LC models n Relax the independence assumption n Each latent variable gives a way to partition data… multidimensional clustering

ICAC Data // 31 variables, 1200 samples C_City: s0 s1 s2 s3 // very common, quit common, uncommon,.. C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable,... Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average,... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence:s0 s1 s2 s3 s4 // very sufficient, sufficient, average,... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 ….

Latent Structure Discovery Y2: Demographic info; Y3: Tolerance toward corruption Y4: ICAC performance; Y7: ICAC accountability Y5: Change in level of corruption; Y6: Level of corruption ( Zhang, Poon, Wang and Chen 2008 )

Interpreting Partition l Y2 partition the population into 4 clusters l What is the partition about? What is “criterion”? l On what manifest variables do the clusters differ the most? l Mutual information: n The larger I(Y2, X), the more the 4 clusters differ on X

Interpreting Partition l Information curves: n Partition of Y2 is based on Income, Age, Education, Sex n Interpretation: Y2 --- Represents a partition of the population based on demographic information n Y3 --- Represents a partition based on Tolerance toward Corruption

Interpreting Clusters Y2=s0: Low income youngsters; Y2=s1: Women with no/low income Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income

Interpreting Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude toward C-Gov and C-Bus People who are touch on corruption are equally tough toward C-Gov and C-Bus. People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv

Relationship Between Dimensions Interesting finding: Relationship btw background and tolerance toward corruption Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between.

Result of LCA l Partition not meaningful l Reason: n Local Independence not true l Another way to look at it n LCA assumes that all the manifest variables joint defines a meaningful way to cluster data n Obviously not true for ICAC data n Instead, one should look for subsets that do define meaningful partition and perform cluster analysis on them n This is what we do with LTA

Finite Mixture Models l Y: discrete latent variable l Xi: continuous l P(X1, X2, …, Xn|Y): Usually multivariate Gaussian l No independence assumption l Assume states of Y: 1, 2, …, k P(X1, X2, …, Xn) = P(Y=i)P(X1, X2, …, Xn|Y=i): Mixture of k Gaussian components

Finite Mixture Models l Used to cluster continuous data l Learning n Determine  k: number of clusters  P(Y)  P(X1, …, Xn|Y) l Also assume: All attributes define coherent partition n Not realistic n LT models are a natural framework for clustering high dimensional data

Observation on How Human Brain Does Thinking l Human beings often invoke latent variables to explain regularities that we observe. l Example 1 n Observe Regularity:  Beers, Diapers often bought together in early evening n Hypothesize (latent) cause:  There must be a common (latent) cause n Identify the cause and explain regularity  Shopping by Father of Babies on the way home from work  Based on our understanding of the world

l Example 2 n Background: At night, watch lighting throw windows of apartments in big buildings n Observe Regularity:  Lighting from several apartments were changing in brightness and color at the same times and in perfect synchrony. n Hypothesize common (latent) cause:  There must be a (late) common cause n Identify the cause and explain the phenomenon:  People watching the same TV channel.  Based on understanding of the world Observation on How Human Brain Does Thinking

Back to Ancient Time l Observe Regularity n Several symptoms often occur together  ‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’ l Hypothesize common latent cause: n There must be a common latent cause l Identify the cause n Answer based on understanding of world at that time, primitive n Conclusion: Yang deficiency ( 阳虚 )  Explanation: Yang is like the sun, it warms your body. If you don’t have enough of it, feel cold.

Back to Ancient Time l Regularity observed: n Several symptoms often occur together  Tidal fever ( 潮热 ) ， heat sensation in palm and feet ( 手足心热 ), palpitation ( 心慌心跳 ), thready and rapid pulse ( 脉细数 ) l Hypothesize common latent cause: n There must be a common latent cause l Identify the cause and explain the regularirt n Yin deficiency causing internal heart ( 阴虚内热 )  Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess relatively, and hence causes heat.

Traditional Chinese Medicine (TCM) l Claim n TCM Theories = Statistical Regularities + Subjective Interpretations l How to justify the claim

A Case Study n We collected a data set about kidney deficiency ( 肾虚 ) n 35 symptom variables, 2600 records

Result of Data Analysis l Y0-Y34: manifest variables from data l X0-X13: latent variables introduced by data analysis l Structure interesting, supports TCM’s theories about various symptoms.

Other TCM Data Sets l From Beijing U of TCM, 973 project n Depression n Hepatitis B n Chronic Renal Failure n COPD n Menopause l China Academy of TCM n Subhealth n Diabetes l In all cases, results of LT analysis match relevant TCM Theories

Result on the Depression Data

Significance l Conclusion n TCM Theories = Statistical Regularities + Subjective Interpretations l Significance n TCM theories are partially based on objective facts  Boast user confidence n Can help to lay a modern statistical foundation for TCM  Systematically identify statistical regularities about occurrence of symptoms, find natural partitions  Establish objective and quantitative diagnosis standards  Assist in double-blind experiments for evaluate and improve the efficacy of TCM treatment

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Similar presentations

Presentation on theme: "L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Similar presentations

Presentation on theme: "L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology"— Presentation transcript:

Similar presentations

About project

Feedback