L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Slides:



Advertisements
Similar presentations
Autonomic Scaling of Cloud Computing Resources
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Brief introduction on Logistic Regression
Latent Tree Models Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. AAAI 2014 Tutorial.
What is Statistical Modeling
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Latent Structure Models and Statistical Foundation for TCM Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Visual Recognition Tutorial
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Statistical Methods Chichang Jou Tamkang University.
Lecture 16: Wrap-Up COMP 538 Introduction of Bayesian networks.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Lecture 15: Hierarchical Latent Class Models Based ON N. L. Zhang (2002). Hierarchical latent class models for cluster analysis. Journal of Machine Learning.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Latent Structure Models & Statistical Foundation for TCM Nevin L. Zhang The Hong Kong University of Science & Techology.
Visual Recognition Tutorial
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Latent Tree Models Part II: Definition and Properties
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin.
Molecular phylogenetics
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Latent Tree Models & Statistical Foundation for TCM Nevin L. Zhang Joint Work with: Chen Tao, Wang Yi, Yuan Shihong Department of Computer Science & Engineering.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Lecture 2: Statistical learning primer for biologists
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
CSE 4705 Artificial Intelligence
Lecture 1.31 Criteria for optimal reception of radio signals.
Latent variable discovery in classification models
Review of Probability.
Background Information for Project
Latent Tree Analysis Nevin L. Zhang* and Leonard K. M. Poon**
Phylogenetic basis of systematics
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Maximum Likelihood Estimation
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Yulong Xu Henan University of Chinese Medicine
CS 394C: Computational Biology Algorithms
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Page 2 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 3 Traditional Uses l Probabilistic Expert Systems n Diagnostic n Prediction l Example: BN for diagnosing “blue baby” over phone in a London Hospital Comparable to specialist, Better than others

Page 4 Traditional Uses l Language for describing probabilistic models in Science & Engineering l Example: BN for turbo code

Page 5 Traditional Uses l Language for describing probabilistic models in Science & Engineering l Example: BN from Bioinformatics

Page 6 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 7 BN for Structure Discovery l Given: Data set D on variables X1, X2, …, Xn l Discover dependence, independence, and even causal relationship among the variable. l Example: Evolution trees

Page 8 Phylogenetic Trees l Assumption n All organisms on Earth have a common ancestor n This implies that any set of species is related. l Phylogeny n The relationship between any set of species. l Phylogenetic tree n Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree  this is not always true

Page 9 Phylogenetic Trees lPhylogenetic trees giant panda lesser panda moose goshawk vulture duck alligator Time Current-day species at bottom

Page 10 Phylogenetic Trees l TAXA (sequences) identify species l Edge lengths represent evolution time l Assumption: bifurcating tree topology Time AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT

Page 11 l Characterize relationship between taxa using substitution probability: P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), …. Probabilistic Models of Evolution s3s3 s4s4 s1s1 s2s2 t5t5 t6t6 t1t1 t2t2 t3t3 t4t4 x5x5 x6x6 x7x7

Page 12 l What should P(x|y, t) be? l Two assumptions of commonly used models n There are only substitutions, no insertions/deletions (aligned)  One-to-one correspondence between sites in different sequences n Each site evolves independently and identically l P(x|y, t) = ∏ i=1 to m P(x(i) | y(i), t) n m is sequence length Probabilistic Models of Evolution AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT

Page 13 l What should P(x(i)|y(i), t) be? n Jukes-Cantor (Character Evolution) Model [1969]  Rate of substitution a (Constant or parameter?) l Multiplicativity (lack of memory) Probabilistic Models of Evolution rtrt stst stst stst stst rtrt stst stst stst stst rtrt stst stst stst stst rtrt A C G T ACGT r t = 1/4 (1 + 3e -4  t ) s t = 1/4 (1 - e -4  t ) Limit values when t = 0 or t = infinity?

Page 14 Tree Reconstruction l Given: collection of current- day taxa l Find: tree n Tree topology: T n Edge lengths: t l Maximum likelihood n Find tree to maximize P(data | tree) AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

Page 15 l When restricted to one particular site, a phylogenetic tree is an LT model where n The structure is a binary tree and variables share the same state space. n The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization. n The model is the same for different sites Tree Reconstruction AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT

Page 16 Tree Reconstruction l Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT l Samples for LT model. One Sample per site. The samples are i.i.d. n 1st site: (A, T, T, A, A), n 2nd site: (G, A, A, G, G), n 3rd site: (G, G, G, C, C), AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT

Page 17 Tree Reconstruction l Finding ML phylogenetic tree == Finding ML LT model l Model space: n Model structures: binary tree where all variables share the same state space, which is known. n Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters). l The objective is to find relationships among variables. l Applying new LTM algorithms to Phylogenetic tree reconstruction?

Page 18 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 19 BN for Density Estimation l Given: Data set D on variables X1, X2, …, Xn l Estimate: P(X1, X2, …, Xn) under some constraints l.. l Uses of the estimate: n Inference n Classification

Page 20 BN Methods for Density Estimation l Chow-Liu tree with X1, X2, …, Xn as nodes n Easy to compute n Easy to use n Might not be good estimation of “true” distribution l BN with X1, X2, …, Xn as nodes n Can be good estimation of “true” distribution. n Might be difficult to find n Might be complex to use

Page 21 BN Methods for Density Estimation l LC model with X1, X2, …, Xn as manifest variables (Lowd and Domingos 2005) n Determine the cardinality of the latent variable using hold-out validation, n Optimize the parameters using EM. n.. n Easy to compute n Can be good estimation of “true” distribution n Might be complex to use (cardinality of latent variable might be very large)

Page 22 BN Methods for Density Estimation l LT model for density estimation l Pearl 1988: As model over manifest variables, LTMs n Are computationally very simple to work with. n Can represent complex relationships among manifest variables.

Page 23 BN Methods for Density Estimation l New approximate inference algorithm for Bayesian networks (Wang, Zhang and Chen, AAAI 08, JAIR 32: , 08 ) SampleLearn sparse sparse dense dense

Page 24 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 25 Bayesian Networks for Classification l The problem: n Given data: n Find mapping  (A1, A2, …, An) |- C l Possible solutions n ANN n Decision tree (Quinlan) n … n (SVM: Continuous data) A1A2…AnC 01…0T 10…1F..

Page 26 Bayesian Networks for Classification l Naïve Bayes model n From data, learn  P(C), P(Ai|C) n Classification  arg max_c P(C=c|A1=a1, …, An=an) n Very good in practice

Page 27 l Drawback of NB: n Attributes mutually independent given class variable n Often violated, leading to double counting. l Fixes: n General BN classifiers n Tree augmented Naïve Bayes (TAN) models n Hierarchical NB n Bayes rule + Density Estimation n … Bayesian Networks for Classification

Page 28 l General BN classifier n Treat class variable just as another variable n Learn a BN. n Classify the next instance based on values of variables in the Markov blanket of the class variable. n Pretty bad because it does not utilize all available information because of Markov boundary Bayesian Networks for Classification

Page 29 Bayesian Networks for Classification l TAN model n Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29: Bayesian networks classifiers. n Capture dependence among attributes using a tree structure. n During learning,  First learn a tree among attributes: use Chow-Liu algorithm  Add class variable and estimate parameters n Classification  arg max_c P(C=c|A1=a1, …, An=an)

Page 30 Bayesian Networks for Classification l Hierarchical Naïve Bayes models n N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear.Latent variable discovery in classification models. n Capture dependence among attributes using latent variables n Detect interesting latent structures besides classification n Algorithm in the step of DHC.. n …

Page 31 Bayesian Networks for Classification l Bayes Rule l.l. n Chow-Liu n LC model n LT Model l Wang Yi: Bayes rule + LT model is for superior

Page 32 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 33 BN for Clustering l Latent class (LC) model n One latent variable n A set of manifest variables l Conditional Independence Assumption: n Xi’s mutually independent given Y. n Also known as Local Independence Assumption l Used for cluster analysis of categorical data n Determine cardinality of Y: number of clusters n Determine P(Xi|Y): characteristics of clusters

Page 34 BN for Clustering Clustering Criteria l Distance based clustering: n Minimizes intra-cluster variation and/or maximizes inter-cluster variation l LC Model-based clustering: n The criterion follows from the conditional independence assumption n Divide data into clusters such that, in each cluster, manifest variables are mutually independent under the empirical distribution.

Page 35 BN for Clustering l Local independence assumption often not true l LT models generalize LC models n Relax the independence assumption n Each latent variable gives a way to partition data… multidimensional clustering

Page 36 ICAC Data // 31 variables, 1200 samples C_City: s0 s1 s2 s3 // very common, quit common, uncommon,.. C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable,... Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average,... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence:s0 s1 s2 s3 s4 // very sufficient, sufficient, average,... … ….

Page 37 Latent Structure Discovery Y2: Demographic info; Y3: Tolerance toward corruption Y4: ICAC performance; Y7: ICAC accountability Y5: Change in level of corruption; Y6: Level of corruption ( Zhang, Poon, Wang and Chen 2008 )

Page 38 Interpreting Partition l Y2 partition the population into 4 clusters l What is the partition about? What is “criterion”? l On what manifest variables do the clusters differ the most? l Mutual information: n The larger I(Y2, X), the more the 4 clusters differ on X

Page 39 Interpreting Partition l Information curves: n Partition of Y2 is based on Income, Age, Education, Sex n Interpretation: Y2 --- Represents a partition of the population based on demographic information n Y3 --- Represents a partition based on Tolerance toward Corruption

Page 40 Interpreting Clusters Y2=s0: Low income youngsters; Y2=s1: Women with no/low income Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income

Page 41 Interpreting Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude toward C-Gov and C-Bus People who are touch on corruption are equally tough toward C-Gov and C-Bus. People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv

Page 42 Relationship Between Dimensions Interesting finding: Relationship btw background and tolerance toward corruption Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between.

Page 43 Result of LCA l Partition not meaningful l Reason: n Local Independence not true l Another way to look at it n LCA assumes that all the manifest variables joint defines a meaningful way to cluster data n Obviously not true for ICAC data n Instead, one should look for subsets that do define meaningful partition and perform cluster analysis on them n This is what we do with LTA

Page 44 Finite Mixture Models l Y: discrete latent variable l Xi: continuous l P(X1, X2, …, Xn|Y): Usually multivariate Gaussian l No independence assumption l Assume states of Y: 1, 2, …, k P(X1, X2, …, Xn) = P(Y=i)P(X1, X2, …, Xn|Y=i): Mixture of k Gaussian components

Page 45 Finite Mixture Models l Used to cluster continuous data l Learning n Determine  k: number of clusters  P(Y)  P(X1, …, Xn|Y) l Also assume: All attributes define coherent partition n Not realistic n LT models are a natural framework for clustering high dimensional data

Page 46 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project

Page 47 Observation on How Human Brain Does Thinking l Human beings often invoke latent variables to explain regularities that we observe. l Example 1 n Observe Regularity:  Beers, Diapers often bought together in early evening n Hypothesize (latent) cause:  There must be a common (latent) cause n Identify the cause and explain regularity  Shopping by Father of Babies on the way home from work  Based on our understanding of the world

Page 48 l Example 2 n Background: At night, watch lighting throw windows of apartments in big buildings n Observe Regularity:  Lighting from several apartments were changing in brightness and color at the same times and in perfect synchrony. n Hypothesize common (latent) cause:  There must be a (late) common cause n Identify the cause and explain the phenomenon:  People watching the same TV channel.  Based on understanding of the world Observation on How Human Brain Does Thinking

Page 49 Back to Ancient Time l Observe Regularity n Several symptoms often occur together  ‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’ l Hypothesize common latent cause: n There must be a common latent cause l Identify the cause n Answer based on understanding of world at that time, primitive n Conclusion: Yang deficiency ( 阳虚 )  Explanation: Yang is like the sun, it warms your body. If you don’t have enough of it, feel cold.

Page 50 Back to Ancient Time l Regularity observed: n Several symptoms often occur together  Tidal fever ( 潮热 ) , heat sensation in palm and feet ( 手足心热 ), palpitation ( 心慌 心跳 ), thready and rapid pulse ( 脉细数 ) l Hypothesize common latent cause: n There must be a common latent cause l Identify the cause and explain the regularirt n Yin deficiency causing internal heart ( 阴虚内热 )  Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess relatively, and hence causes heat.

Page 51 Traditional Chinese Medicine (TCM) l Claim n TCM Theories = Statistical Regularities + Subjective Interpretations l How to justify the claim

Page 52 A Case Study n We collected a data set about kidney deficiency ( 肾虚 ) n 35 symptom variables, 2600 records

Page 53 Result of Data Analysis l Y0-Y34: manifest variables from data l X0-X13: latent variables introduced by data analysis l Structure interesting, supports TCM’s theories about various symptoms.

Page 54 Other TCM Data Sets l From Beijing U of TCM, 973 project n Depression n Hepatitis B n Chronic Renal Failure n COPD n Menopause l China Academy of TCM n Subhealth n Diabetes l In all cases, results of LT analysis match relevant TCM Theories

Page 55 Result on the Depression Data

Page 56 Significance l Conclusion n TCM Theories = Statistical Regularities + Subjective Interpretations l Significance n TCM theories are partially based on objective facts  Boast user confidence n Can help to lay a modern statistical foundation for TCM  Systematically identify statistical regularities about occurrence of symptoms, find natural partitions  Establish objective and quantitative diagnosis standards  Assist in double-blind experiments for evaluate and improve the efficacy of TCM treatment

Page 57 Outline l Traditional Uses l Structure Discovery l Density Estimation l Classification l Clustering l An HKUST Project