Advanced Algorithms and Models for Computational Biology -- a machine learning approach Molecular Ecolution: Phylogenetic trees Eric Xing Lecture 21, April.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Markov Chains Lecture #5
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
… Hidden Markov Models Markov assumption: Transition model:
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Phylogeny Tree Reconstruction
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Tree Inference Methods
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Modeling Molecular Substitution Von Bing Yap Statistics Department, UC Berkeley
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Phylogeny Ch. 7 & 8.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sequence Alignment.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Modelling evolution Gil McVean Department of Statistics TC A G.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Models of Sequence Evolution
Hidden Markov Models Part 2: Algorithms
The Most General Markov Substitution Model on an Unrooted Tree
Hidden Markov Model Lecture #6
Presentation transcript:

Advanced Algorithms and Models for Computational Biology -- a machine learning approach Molecular Ecolution: Phylogenetic trees Eric Xing Lecture 21, April 5, 2006 Reading: DTW book, Chap 12 DEKM book, Chap 7, 8

A pair of homologous bases Typically, the ancestor is unknown. ancestor A C QhQh QmQm T years ?

How does sequence variation arise? Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation. Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around. Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.

Modeling DNA base substitution Strictly speaking, only applicable to regions undergoing little selection. Standard assumptions (sometimes weakened)  Site independence.  Site homogeneity.  Markovian: given current base, future substitutions independent of past.  Temporal homogeneity: stationary Markov chain.

More assumptions Q h = s h Q and Q m = s m Q, for some positive s h, s m, and a rate matrix Q. The ancestor is sampled from the stationary distribution  of Q. Q is reversible: for a, b, t  0  (a)P(t,a,b) = P(t,b,a)  (b), (detailed balance).

The stationary distribution A probability distribution  on {A,C,G,T} is a stationary distribution of the Markov chain with transition probability matrix P = P(i,j), if for all j,  i  (i) P(i,j) =  (j). Exercise. Given any initial distribution, the distribution at time t of a chain with transition matrix P converges to  as t  . Thus,  is also called an equilibrium distribution. Exercise. For the Jukes-Cantor and Kimura models, the uniform distribution is stationary. (Hint: diagonalize their infinitesimal rate matrices.) We often assume that the ancestor sequence is i.i.d .

Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain – avoid areas with (too many) gaps Major methods: Parsimony phylogeny methods Likelihood methods Phylogeny methods

ancestor ~  A C Q Q s h T PAMs s m T PAMs New picture

Joint probability of A and C Under the model in the previous slides, the joint probability is where t = s h T+ s m T is the (evolutionary) distance between A and C. Note that s h, s m and T are not identifiable. The matrix F(t) is symmetric. It is equally valid to view A as the ancestor of C or vice versa.

Estimating the evolutionary distance between two sequences Suppose two aligned protein sequences a 1 …a n and b 1 …b n are separated by t PAMs. Under a reversible substitution model that is IID across sites, the likelihood of t is where c(a,b) = # {k : a k = a, b k = b}. Maximizing this quantity gives the maximum likelihood estimate of t. This generalizes the distance correction with Jukes-Cantor.

Phylogeny The shaded nodes represent the observed nucleotides at a given site for a set of organisms The unshaded nodes represent putative ancestral nucleotides Transitions between nodes capture the dynamic of evolution

A tree, with branch lengths, and the data at a single site. Since the sites evolve independently on the same tree, ACAGTGACGCCCCAAACGT ACAGTGACGCTACAAACGT CCTGTGACGTAACAAACGA CCTGTGACGTAGCAAACGA t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t8t8 t7t7 Likelihood methods

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t8t8 t7t7 x y z w Likelihood at one site on a tree We can compute this by summing over all assignments of states x, y, z and w to the interior nodes: Due to the Markov property of the tree, we can factorize the complete likelihood according to the tree topology: Summing this up, there are 256 terms in this case!

( ()) Getting a recursive algorithm when we move the summation signs as far right as possible:

Felsenstein’s Pruning Algorithm To calculate P(x 1, x 2, …, x N | T, t) Initialization: Set k = 2N – 1 Recursion: Compute P(L k | a) for all a   If k is a leaf node: Set P(L k | a) = 1(a = x k ) If k is not a leaf node: 1. Compute P(L i | b), P(L j | b) for all b, for daughter nodes i, j 2. Set P( L k | a) =  b, c P(b | a, t i )P(L i | b) P(c | a, t j ) P(L j | c) Termination: Likelihood at this column = P(x 1, x 2, …, x N | T, t) =  a P(L 2N-1 | a)P(a) This algorithm can easily handle Ambiguity and error in the sequences (how?) A c A b a i j k a k

Finding the ML tree So far I have just talked about the computation of the likelihood for one tree with branch lengths known. To find a ML tree, we must search the space of tree topologies, and for each one examined, we need to optimize the branch lengths to maximize the likelihood.

Bayesian phylogeny methods Bayesian inference has been applied to inferring phylogenies (Rannala and Yang, 1996;Mau and Larget, 1997; Li, Pearl and Doss, 2000). All use a prior distribution on trees. The prior has enough influence on the result that its reasonableness should be a major concern. In particular, the depth of the tree may be seriously affected by the distribution of depths in the prior. All use Markov Chain Monte Carlo (MCMC) methods. They sample from the posterior distribution. When these methods make sense they not only get you a point estimate of the phylogeny, they get you a distribution of possible phylogenies.

Modeling rate variation among sites A G A G A C A A A A A G A A A T A G...

A model of variation in evolutionary rates among sites The basic idea is that the rate at each site is drawn independently from a distribution of rates. The most widely used choice is the Gamma distribution, which has density function: Gamma distributions ( ,  )

Unrealistic aspects of the model: There is no reason, aside from mathematical convenience, to assume that the Gamma is the right distribution. A common variation is to assume there is a separate probability f0 of having rate 0. Rates at different sites appear to be correlated, which this model does not allow. Rates are not constant throughout evolution, they change with time.

Rates varying among sites If L (i) ( r i ) is the likelihood of the tree for site i given that the rate of evolution at site i is r i, we can integrate this over a gamma density: so that the overall likelihood is Unfortunately these integrals cannot be evaluated for trees with more than a few tips as the quantities L (i) ( r i ) becomes complicated.

There are a finite number of rates (denote rate i as r i ). There are probabilities p i of a site having rate i. A process not visible to us ("hidden") assigns rates to sites. The probability of our seeing some data are to be obtained by summing over all possible combinations of rates, weighting appropriately by their probabilities of occurrence. Modeling rate variation among sites

AAAAAAAAAAAAAAAAA CGTAGAAAAGAGTCAAT iiiiiie1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 iii... Y1Y1 Y2Y2 Y3Y3 YTYT X1X1 X2X2 X3X3 XTXT Rocall the HMM The shaded nodes represent the observed nucleotides at particular sites of an organism's genome For discrete Y i, widely used in computational biology to represent segments of sequences gene finders and motif finders profile models of protein domains models of secondary structure

Definition (of HMM) Observation space Alphabetic set: Euclidean space: Index set of hidden states Transition probabilities between any two states or Start probabilities Emission probabilities associated with each state or in general: AAAA x2x2 x3x3 x1x1 xTxT y2y2 y3y3 y1y1 yTyT... Graphical model K 1 … 2 State automata

A G A G A C A A A A A G A A A T A G... Hidden Markov Phylogeny Replacing the standard emission model with a tree A process not visible to us (.hidden") assigns rates to sites. It is a Markov process working along the sequence. For example it might have transition probability Prob ( j | i ) of changing to rate j in the next site, given that it is at rate i in this site. These are the most widely used models allowing rate variation to be correlated along the sequence.

The Forward Algorithm We can compute for all k, t, using dynamic programming! Initialization: Iteration: Termination:

The Backward Algorithm We can compute for all k, t, using dynamic programming! Initialization: Iteration: Termination:

A G A G A C A A A A A G A A A T A G... Hidden Markov Phylogeny this yields a gene finder that exploits evolutionary constraints

Based on sequence data from primate species, McAuliffe et al (2003) obtained sensitivity of 100%, with a specificity of 89%. Genscan (state-of-the-art gene finder) yield a sensitivity of 45%, with a specificity of 34%. A Comparison of comparative genomic gene-finding and isolated gene-finding

Observation: Finding a good phylogeny will help in finding the genes. Finding the genes will help to find biologically meaningful phylogenetic trees Which came first, the chicken or the egg? Open questions (philosophical)

Open questions (technical) How to learn a phylogeny (topology and transition prob.)? Should different site use the same phylogeny? Function- specific phylogeny? Other evolutionary events: duplication, rearrangement, lateral transfer, etc.

Acknowledgments Terry Speed: for some of the slides modified from his lectures at UC Berkeley Phil Green and Joe Felsenstein: for some of the slides modified from his lectures at Univ. of Washington