1 Molecular evolution, cont. Estimating rate matrices Lecture 15, Statistics 246 March 11, 2004.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Trees Lecture 4
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Maximum likelihood (ML)
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
1 Multiple sequence alignment and phylogenetic trees Stat 246, Spring 2002, Week 5b.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Modeling Molecular Substitution Von Bing Yap Statistics Department, UC Berkeley
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Phylogeny Ch. 7 & 8.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Molecular Ecolution: Phylogenetic trees Eric Xing Lecture 21, April.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Modelling evolution Gil McVean Department of Statistics TC A G.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Hidden Markov Models BMI/CS 576
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Hidden Markov Models Part 2: Algorithms
Inferring phylogenetic trees: Distance and maximum likelihood methods
The Most General Markov Substitution Model on an Unrooted Tree
Presentation transcript:

1 Molecular evolution, cont. Estimating rate matrices Lecture 15, Statistics 246 March 11, 2004

2 Q = P(t) = REVIEW:The Jukes-Cantor model (1969) -3       rsss srss ssrs sssr r = (1+3e  4  t )/4, s = (1  e  4  t )/4. Common ancestor of human and orang. t time units human (now) Consider e.g. the 2nd position in a-globin2 Alu1.

3 common ancestor G orang C human still 2 nd position in a-globin Alu 1 Assume that the common ancestor has A, G, C or T with probability 1/4. Then the chance of the nt differing p ≠ = 3/4  (1  e  8  t ) = 3/4  (1  e  4k/3 ), since k =2  3  t Solving for k = estimating distance in PAMs t 3/4 REVIEW:Jukes-Cantor adjustment

4 REVIEW: Estimating the evolutionary distance between two sequences Suppose two aligned protein sequences a 1 …a n and b 1 …b n are separated by t PAMs. Under a reversible substitution model that is i.i.d across sites, the likelihood function of t is L(t) = pr(a 1 …a n,b 1 …b n | model) =  k F(t,a k,b k ) =  a,b F(t,a,b) c(a,b), where c(a,b) = # {k : a k = a, b k = b}, and F(t,a,b) =  (a)P(t,a,b) =  (b)P(t,b,a) = F(t,b,a). Maximizing this quantity in t with F known gives the maximum likelihood estimate of t. This generalizes the distance correction with Jukes-Cantor.

5 Acknowledgement Von Bing Yap, for joint work summarized here and in the previous and next lecture.

6 From aligned DNA or protein sequences to evolutionary trees The starting point for a molecular phylogenetic analysis is a set of sequences, almost always aligned. The end result is almost always a tree. Along the way, attention needs to be paid to substitution process operating in the sequences, and to possible rate variation along the sequences, and down the tree. The two main approaches to tree building are a) distance-based methods, which work from pairwise distances between the sequences, and b) character- based methods, which work directly from the multiply aligned sequences. We’ll briefly mention both, referring you to the literature for fuller details. Both make use of rate matrices.

7 Building trees: distance methods There are many ways of building trees using distance methods. All start by computing the pairwise distances between the sequences to be at the tips of the tree, usually along the lines we discussed in the last lecture, i.e. ML distance, using a rate matrix. One of the oldest distance methods, still widely used, though rather discredited in the molecular evolutionary context, is UPGMA. This stands for unweighted pair group method with arithmetic means. It is easy to understand quickly, and so I will describe it. I don’t recommend it. A more recent and much more satisfactory method in molecular evolution is the neighbour-joining approach, abbrev. NJ. It takes longer to explain, so I won’t give it here. There are many places where the details of this and other methods are given including Durbin et al (1998), and the recent excellent book by the master: Joseph Felsenstein, Inferring phylogenies, Sinauer, 2004.

8 Beta-globins revisited M V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T QBG-human N... T BG-macaque - - M.. A... A.... F.... K BG-bovine -... S G G N I N. L BG-platypus... W. A... Q L I. G A. C. A... A... I......BG-chicken -.. W S E V. L H E I. T T. K S I D K H S L. A K.. A. M F I..... TBG-shark R F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L DBG-human S N...BG-macaque A.... N D S.. N. M K...BG-bovine.... A..... S A G A... T S. G. A. K N..BG-platypus... A... N.. S. T. I L... M. R T S. G. A V K N..BG-chicken. Y. G N L K E F T A C S Y G E. A... T.. L G V A V T.. GBG-shark N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F GBG-human Q K BG-macaque D A K V... R N..BG-bovine D K N R..... I V... R.. SBG-platypus. I. N.. S Q D I. I I... A.. SBG-chicken D V. S Q. T D.. K K. A E E.... V. S. K.. A K C F. V E. G I L L KBG-shark K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y HBG-human..... Q BG-macaque..... V L.. D F R..BG-bovine. D. S. E.... W.. L. S... H.. G....BG-platypus. D... E C... W.. L. R V.. H... R...BG-chicken D K. A. Q T.. I W E. Y F G V. V D. I S K E..BG-shark. means same as reference sequence - means deletion

9 UPGMA tree for beta-globins 1 BG-bovine BG-humanBG-macaque BG-platypus BG-chicken BG-shark

10 Neighbor-joining tree for globins epsilon-human alpha-human myo-human beta-human delta-human gamma-human

11 Today’s main task We will discuss three methods of estimating a calibrated reversible rate matrix Q, given aligned leaf sequences on multiple unrooted phylogenetic trees whose topologies and branch lengths are known. All three methods are consistent, in the sense that the methods are asymptotically unbiased as the sequences become infinitely long. Moreover, evolutionary distances between sequences are explicitly accounted for, unlike with the PAM method. The maximum likelihood (ML) method is natural for any parametric model, and has well-known theoretical properties. Maximum partial likelihood (MPL) is particularly well-suited to Markov processes, and can be efficiently implemented via an EM algorithm. The resolvent method (RES) is quite a different technique. In practice, the phylogenetic tree can also be estimated from the data. When the tree topology is known, e.g. when there are just 2 or 3 leaf nodes, the branch lengths can be estimated by ML, given the rate matrix. One can estimate both the rate matrix and the branch lengths by alternating between the two steps: estimate Q given the branch lengths; estimate the branch lengths given Q. Estimating the tree topology as well is a harder problem, on which much has been written.

12 Parametrization of reversible rate matrices A transition matrix P(t) with stationary distribution  is reversible if for all a,b  (a)P(a, b;t) =  (b)P(b,a;t). If P(t) = exp tQ, then we can write P(t) = I + tQ for small t, and conclude that if P is reversible, then for all a,b we have  (a)Q(a, b) =  (b)Q(b,a). (*) Since  (a)>0 for all a may be readily assumed, write R(a,b) =  (a) -1 Q(b,a). Then (*) implies that R is symmetric (check), and so if we write  for the diagonal matrix with the value  (a) in row-column a, we have proved that for a reversible chain, Q = R  for a unique symmetric matrix R. Now let us see that such Q are diagonalizable. Let A be the square root of . Then Q = R  = A -1 (ARA)A, where ARA is symmetric. We can now write ARA = V  V’, where V is orthogonal and  diagonal. Then Q is seen to be diagonalizable: Q = A -1 V  V’A. This makes lots of things easier when we deal with numerical work involving the class REV of reversible rate matrices.

13 Maximum likelihood Suppose that the ancestral sequences at the internal nodes are known. Take any leaf sequence for the root, cf last lecture, let t i be the length of the ith branch, and let (i) be the frequency table for the pair of sequences connected by this branch: (i, a, b) = |{ancestor =a, descendant = b}|. Denote the row sum of the frequency table corresponding to the root node by f. Then the probability of all sequences is  a  (a) f(a)  i  a,b P(t i,a,b) (i, a, b). By reversibility, this value is independent of the choice of root. In almost all applications, the ancestral sequences are unknown. Imputation by parsimony is biased when the sequences are not closely related, and even when they are, it is better to weight the unobserved sequences according to some probabilistic model. In principle we can sum the above probability over all possible ancestral sequences to get the probability of the observed sequences. Felsenstein has given an efficient algorithm for doing this, by recursively moving up the nodes to the root. It is a tree version of one of the HMM calculations.

14 Maximum likelihood, cont. In order to maximize the likelihood, we need to choose a parametrization. A natural choice is to use the equilibrium frequencies , and the top off-diagonal elements of a symmetric matrix R in the factorization Q = R , with constraints for all a,  (a) ≥ 0, for all a<b, R(a,b) ≥ 0,  a  (a) = 1, 2  a<b  (a)  (b)R(a,b) = The last constraint is for calibration. Even for the simplest case, that is where each tree consists of a pair of sequences, it is difficult to get closed-form expressions for the MLE, and so numerical maximization must be used.

15 Maximum likelihood, cont. If the tree consists of just two sequences a and b, the likelihood function depends on the tree only through the evolutionary distance t between the sequences, and can be written out explicitly. Let denote the frequency table for the pair, see two slides back. Then the likelihood has the multinomial form we saw for estimating the distance between two sequences, namely  a,b F(t,a,b) (a,b) where the matrix F(t) represents the joint distribution of homologous residues at distance t, given by F(t,a,b) =  (a)P(t,a,b) =  (b)P(t,b,a) = F(t,b,a). The symmetry of F(t) is a consequence of reversibility, and embodies the fact that any intermediate sequence between a and b may be regarded as the ancestor.

16 Maximum partial likelihood For a given tree, instead of maximizing the probability of the leaf sequences, one can maximize their conditional probability given the ancestral sequence. This yields the maximum partial likelihood estimator (MPLE), which can be proved consistent, though less efficient, in comparison with the MLE. If all ancestral sequences are observed, and all branch lengths have the same length t, then the partial likelihood is  i  a,b P(t,a,b) (i, a, b) =  a,b P(t,a,b) C(a, b), where C(a,b) =  i (i,a,b). The MPLE for P(t) is readily seen to be P^(t,a,b) = C(a,b) /  b C(a,b), which is just Dayhoff’s P. Of course the assumptions in blue are unrealistic.

17 Maximum partial likelihood, cont. When the ancestral sequences are unknown, we may regard any leaf as the ancestor, since the substitution process is reversible. Fix a leaf sequence, and consider the relabelled tree with that leaf as root. Then, summing over all possible intermediate sequences (internal nodes) yields the partial likelihood based on the observed sequences. As with ML, numerical maximization is necessary, but in this case, the partial likelihood can be efficiently maximized using an EM algorithm, as shown by Holmes and Rubin (2002). We present the EM algorithm for two sequences generated by a discrete time Markov chain. The extension to continuous time is straightforward. If time permits, we’ll discuss the tree (multiple sequence) version. The EM algorithm, unlike the ML, does not impose reversibility on its estimates. It works on any time homogeneous substitution process on a rooted tree, provided the root sequence is observed. Further, the initial distribution need not be the stationary distribution. In practice, the root sequence is almost never observed, so by assuming reversibility, the EM can be applied starting from any leaf. To ensure that the EM leads to a reversible rate matrix, the frequency tables are symmetrized at the start.

18 Pair EM, discrete-time case. Let P be a reversible transition matrix, with equilibrium and initial distribution . Let the sequence at time s be X s,1, ….,X s,n. Let t be a fixed position integer, and suppose that only sequences of length n at times 0 and t are observed. We denote the observed data by O, and the full data by F. For s=1,…t, let (s) be the frequency table for transitions between s-1 and s: (s,a,b) = |{k: X s-1,k = a, X s,k = b}|. Then the partial likelihood for the full data is

19 PMLE, cont. The key idea of the EM algorithm is that replacing the unobserved counts by their conditional expectations given the observed data, evaluated at the current estimate of P, gives a new estimate with a higher partial likelihood. The algorithm iterates between two steps: E-step: given a current estimate P 0, compute the quantity G(P 1,P 0 ) = E  0,P0 {log L(P 1, F | O)} M-step: maximize G(P 1,P 0 ) as a function of P 1. By site independence, G(P 1,P 0 ) is a sum of similar terms. We fix a site, and suppose that X 0 = a, X t = b. Its contribution to G(P 1,P 0 ) is then

20 PMLE, almost completed.

21 PMLE, completed. The high powers of P are easily computed when Q, and hence P is diagonalizable. If there are independent sequences separated by possibly different known times, then in the E-step, G( P 0, P 1 ) is just the sum of the contributions from the pairs.