Download presentation
Presentation is loading. Please wait.
1
Lecture 16: Wrap-Up COMP 538 Introduction of Bayesian networks
2
Phylogeny / Slide 2 Nevin L. Zhang, HKUST Recap l Latent class models n Clustering n Clustering criterion: conditional independence n Drawback: Assumption too strong l Hierarchical latent class (HLC) models n Identifiability issues: regularity, equivalence n Hill climbing algorithm
3
Phylogeny / Slide 3 Nevin L. Zhang, HKUST Today l Phylogenetic (evolution) trees n Closely related to HLC models n An example of viewing existing models in the framework of BN –Another example: HMM n Interesting because –Ease understanding –Techniques in one field applied to another l Structural EM for phylogenetic trees l Dynamic BNs for speech understanding –Development of general purpose algorithms l Bayesian networks for classification n Hand waving only
4
Phylogeny / Slide 4 Nevin L. Zhang, HKUST Phylogenetic Tree Outline l Introduction to phylogenetic trees l Probabilistic models of evolution l Tree reconstruction
5
Phylogeny / Slide 5 Nevin L. Zhang, HKUST Phylogenetic Trees l Assumption n All organisms on Earth have a common ancestor n This implies that any set of species is related. l Phylogeny n The relationship between any set of species. l Phylogenetic tree n Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree –this is not always true
6
Phylogeny / Slide 6 Nevin L. Zhang, HKUST Phylogenetic Trees l Phylogenetic trees giant panda lesser panda moose goshawk vulture duck alligator Time Current-day species at bottom
7
Phylogeny / Slide 7 Nevin L. Zhang, HKUST Phylogenetic Trees l TAXA (sequences) identify species l Edge lengths represent evoluation time l Assumption: bifurcating tree toplogy Time AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT
8
Phylogeny / Slide 8 Nevin L. Zhang, HKUST l Characterize relationship between taxa using substitution probability: –P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t –P(X 7 ), P(X 5 |X 7, t 5 ), P(X 6 |X 7, t 6 ), P(S 1 |X 5, t 1 ), P(S 2 |X 5, t 2 ), …. Probabilistic Models of Evolution s3s3 s4s4 s1s1 s2s2 t5t5 t6t6 t1t1 t2t2 t3t3 t4t4 x5x5 x6x6 x7x7
9
Phylogeny / Slide 9 Nevin L. Zhang, HKUST l What should P(x|y, t) be? l Two assumptions of commonly used models n There are only substitutions, no insertions/deletions (aligned) –One-to-one correspondence between sites in different sequences n Each site evolves independently and identically P(x|y, t) = i=1 to m P(x(i) | y(i), t) n m is sequence length Probabilistic Models of Evolution AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT AAGGCAT
10
Phylogeny / Slide 10 Nevin L. Zhang, HKUST l What should P(x(i )|y(i), t) be? n Jukes-Cantor (Character Evolution) Model [1969] –Rate of substitution (Constant or parameter?) l Multiplicativity (lack of memory) Probabilistic Models of Evolution rtrt stst stst stst stst rtrt stst stst stst stst rtrt stst stst stst stst rtrt A C G T ACGT r t = 1/4 (1 + 3e -4 t ) s t = 1/4 (1 - e -4 t ) Limit values when t = 0 or t = infinity?
11
Phylogeny / Slide 11 Nevin L. Zhang, HKUST Tree Reconstruction l Given: collection of current-day taxa l Find: tree n Tree topology: T n Edge lengths: t l Maximum likelihood n Find tree to maximize P(data | tree) AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT
12
Phylogeny / Slide 12 Nevin L. Zhang, HKUST l When restricted to one particular site, a phylogenetic tree is an HLC model where n The structure is a binary tree and variables share the same state space. n The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization. n The model is the same for different sites Tree Reconstruction AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT
13
Phylogeny / Slide 13 Nevin L. Zhang, HKUST Tree Reconstruction Current-day Taxa : AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT Samples for HLC model. One Sample per site. The samples are i.i.d. 1 st site : (A, T, T, A, A), 2 nd site : (G, A, A, G, G), 3 rd site : (G, G, G, C, C), n…n… AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AAGACTT AGCACTT AAGGCCT
14
Phylogeny / Slide 14 Nevin L. Zhang, HKUST Tree Reconstruction l Finding ML phylogenetic tree == Finding ML HLC model l Model space: n Model structures: binary tree where all variables share the same state space, which is known. n Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters).
15
Phylogeny / Slide 15 Nevin L. Zhang, HKUST Bayesian Networks for Classification l The problem: n Given data: n Find mapping –(A1, A2, …, An) |- C l Possible solutions n ANN n Decision tree (Quinlan) n…n… A1A2…AnC 0110T 1011F..
16
Phylogeny / Slide 16 Nevin L. Zhang, HKUST Bayesian Networks for Classification l Naïve Bayes model n From data, learn –P(C), P(Ai|C) n Classification –arg max_c P(C=c|A1=a1, …, An=an) n Very good in practice
17
Phylogeny / Slide 17 Nevin L. Zhang, HKUST l Drawback of NB: n Attributes mutually independent given class variable n Often violated, leading to doubling counting. l Fixes: n General BN classifiers n Tree augmented Naïve Bayes (TAN) models n Hierarchical NB n…n… Bayesian Networks for Classification
18
Phylogeny / Slide 18 Nevin L. Zhang, HKUST l General BN classifier n Treat class variable just as another variable n Learn a BN. n Classify the next instance based on values of variables in the Markov blanket of the class variable. n Pretty bad because it does not utilize all available information Bayesian Networks for Classification
19
Phylogeny / Slide 19 Nevin L. Zhang, HKUST Bayesian Networks for Classification l TAN model n Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29:131-163.Bayesian networks classifiers. n Capture dependence among attributes using a tree structure. n During learning, – First learn a tree among attributes: use Chow-Liu algorithm –Add class variable and estimate parameters n Classification –arg max_c P(C=c|A1=a1, …, An=an)
20
Phylogeny / Slide 20 Nevin L. Zhang, HKUST Bayesian Networks for Classification l Hierarchical Naïve Bayes models n N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear.Latent variable discovery in classification models. n Capture dependence among attributes using latent variables n Detect interesting latent structures besides classification n Currently, slow
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.