CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 6, Thursday April 17, 2003
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Heuristic alignment algorithms and cost matrices
CISC667, F05, Lec21, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction 3-Dimensional Structure.
Phylogeny Tree Reconstruction
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Sequence Alignments Revisited
Probabilistic methods for phylogenetic trees (Part 2)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Phylogeny Tree Reconstruction
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BINF6201/8201 Molecular phylogenetic methods
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Molecular Ecolution: Phylogenetic trees Eric Xing Lecture 21, April.
Phylogenetic Trees - Parsimony Tutorial #13
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Evolutionary Change in Sequences
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Models for DNA substitution
Maximum likelihood (ML) method
3-Dimensional Structure
Variants of HMMs.
Modeling Signals in DNA
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
Boltzmann Machine (BM) (§6.4)
Presentation transcript:

CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods

CISC667, F05, Lec16, Liao2 Bayes rule revisited. P(model|data) = P(data|model) P(model)/P(data) model includes –our evolution theory, –a specific phylogenetic tree (topology and edge lengths), and –assignment of sequences to the tree leaves. Data: a set of sequences that are used to infer phylogeny. Let P(x|y, t) be the probability in that sequence x is evolved from an ancestral sequence y over an edge of length t. P( x 1, x 2, x 3, x 4, x 5 |T, t.) = P( x 1 | x 4, t 1 ) P( x 2 | x 4, t 2 ) P( x 3 | x 5, t 3 ) P( x 4 | x 5, t 4 ) P( x 5 ) A shorthand notation P( x · |T, t.) is used where x · stands for a set of sequences, and t. for edge lengths of the tree T. x 1 2 x 2 x 3 x 4 x 5 t 1 t 2 t 3 t 4

CISC667, F05, Lec16, Liao3 In general, if we know P(x|y, t) for any x, y, and t A tree T, and assignment of sequences to tree nodes, Then we can compute the likelihood for observing the sequences as they are assigned onto the tree leaves. Q: Given n, the number of leaves, there are (2n-3)!! different trees (plus many different ways to assign length to tree edges), which tree can best interpret the data? A: The tree that gives the maximum likelihood (ML). In practice, to implement the ML method, two issues we need to address 1.A model of evolution, which gives the conditional probabilities P(x|y, t) 2.Method to find the maximum likelihood. For this, any optimization method may be utilized, such as Descent gradient Simulated annealing Genetic algorithm

CISC667, F05, Lec16, Liao4 Models of evolution Independence assumption: mutations occur independently at different positions. Therefore, P(x|y, t) =  u P(x u |y u, t), where P(x u |y u, t) is the probability that residue x u in sequence x mutates to residue y u in sequence y over time t. multiplicative assumption: P(b|a, t +  t) =  c P(c|a, t)·P(b|c,  t). For DNA sequences, probabilities for all possible mutations among four nucleotides during a given time period t form a 4 by 4 matrix And, we have multiplicative property for these matrices S(t) ·S(  t) = S(t +  t). P(A|A, t)P(A|C, t)P(A|G, t)P(A|T, t) P(C|A, t)P(C|C, t)P(C|G, t)P(C|T, t) P(G|A, t) P(T|A, t) P(G|C, t) P(T|C, t) P(G|G, t)P(G|T, t) P(T|T, t)P(T|G, t) S(t) =

CISC667, F05, Lec16, Liao5 For each P(b|a, t) in the substitution matrix S(t), it is reasonable to assume that no mutation can occur at zero time interval: P(a|a, 0) = 1 P(b|a, 0) = 0. That is, Further, let’s assume that P(b|a,  t) over a infinitesimal interval  t is proportional to  t by a constant r, called mutation rate: P(b|a,  t) = r  t. Jukes-Cantor model for DNA sequences assumes that all nucleotide mutations have the same rate. That is, S(0) = 1-3r rr r r rr r r r r r r S(  t)=  t  I + R  t

CISC667, F05, Lec16, Liao6 Therefore, S(t +  t) = S(t) ·S(  t) = S(t) ·[I + R  t] [S(t +  t) - S(t)] /  t = S(t)·R S’(t) = S(t)·R when  t  0. Solving the differential equations, we have P(a|a, t) = ¼ (1+ 3e -4rt ) for a  {A,C, G, T} P(b|a, t) = ¼ (1- e -4rt ) for a  b, a and b  {A,C, G, T} In this model, when t = , P(b|a,  ) = ¼. That is, the nucleotide equilibrium frequencies are all equal.

CISC667, F05, Lec16, Liao7 Kimura model for DNA sequences assumes different rates for transitions (A  G and C  T) and transversions (A  T, G  T, A  C, and C  G). That is, Similar models are proposed for mutations among amino acids. If we were able to quantify the “time” as how many number mutations have occurred, the substitute matrices in those models would correspond to PAM matrices at respective times. 1-2r-s rs r r rs s r r s r r S(  t)=  t  I+R  t A C G T ACGTACGT

CISC667, F05, Lec16, Liao8 Maximum Likelihood The case of two sequences aligned with no gaps P( x 1, x 2 |T, t 1, t 2 ) =  u=1 to N P( x 1 u, x 2 u |T, t 1, t 2 ) Let x 1 = CCGGCCGCGCG x 2 = CGGGCCGCCCG P(C,C| T, t 1, t 2 ) = P(C|A, t 2 ) P(A|C, t 1 ) P(A) + P(C|C, t 2 ) P(C|C, t 1 ) P(C) + P(C|G, t 2 ) P(G|C, t 1 ) P(G) + P(C|T, t 2 ) P(T|C, t 1 ) P(T) = ¼ [3  ¼ (1- e -4rt1 )  ¼ (1- e -4rt2 ) + ¼ (1+3 e -4rt1 )  ¼ (1+3 e -4rt2 ) ] = 1/16 (1+3 e -4r(t1 +t2) ). P(C,G| T, t 1, t 2 ) = 1/16 (1- e -4r(t1 +t2) ). Therefore, for an alignment that has n1 identical sites and n2 mutational sites, we have P( x 1, x 2 |T, t 1, t 2 ) = 1/ 16 n1+n2  (1+3 e -4r(t1 +t2) ) n1  (1- e -4r(t1 +t2) ) n2, which is a function of edge lengths in tree T. t1t1 t2t2 x1x1 x2x2

CISC667, F05, Lec16, Liao9 In general, the probability can be computed by working up the tree from the leaves using post-order traversal. This is done by Felsenstein’s algorithm (1981). Once the probability is available, optimizing the assignments of edge lengths t in the tree amounts to Where t m is the tree length assignment that maximize the likelihood.  t  P tmtm = 0

CISC667, F05, Lec16, Liao10 How to optimize tree topology? –Discrete structure, therefore cannot take derivatives. Basic strategy for searching the tree space –A tree generation algorithm that can generate trees –Assess the likelihood Accept Reject A genetic algorithm implementation [Matsuda 1998]

CISC667, F05, Lec16, Liao11 Genetic algorithm Input –P, the population, –r: the fraction of population to be replaced, –f, a fitness, –ft, the fitness_threshold, –m: the rate for mutation. Initialize population (randomly) Evaluate: for each h in P, compute Fitness(h) While [Max h f(h)] < ft do 1.Select 2.Crossover 3.Mutate 4.Update P with the new generation Ps 5.Evaluate: f(h) for all h  P Return the h in P that has the best fitness

CISC667, F05, Lec16, Liao12 Key components for implementing genetic algorithms –Representing hypotheses (which are the trees here) New Hampshire format (A,(F,(G,(B,D)))) –Genetic operators Crossover: –Single –Two point –uniform Mutation: point –Fitness function: we use the likelihood computed for each tree using FCBDA

CISC667, F05, Lec16, Liao13 More advanced topics in phylogenetic analysis –Different heuristics for sampling the tree space Monte Carlo … –More realistic evolutionary models With gaps Non-uniform: different rates at different sites … –Using different data sets and reconciliation Sequences Gene positions [Nadeau & Taylor 1984, PNAS 81: , Pavzner, Sankoff, …] …