1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan WWW:

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic Trees Lecture 4
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogeny Tree Reconstruction
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic Trees - Parsimony Tutorial #12
Inferring a phylogeny is an estimation procedure.
Character-Based Phylogeny Reconstruction
Clustering methods Tree building methods for distance-based trees
Multiple Alignment and Phylogenetic Trees
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
CS 581 Tandy Warnow.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan WWW:

2 Phylogenetic Tree Topology: bifurcating –Leaves - 1 … N –Internal nodes N+1 … 2N-2 leaf branch internal node

3 Orthologues / Paralogues

4 Rooted / Unrooted Tree

5 Counting Trees

6 (2N - 5)!! = # unrooted trees for N taxa (2N- 3)!! = # rooted trees for N taxa C A B D A B C A D B E C A D B E C F

7 Rrooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree

8 UPGMA -- Unweighted Pair Group Method with Arithmetic mean simplest method - uses sequential clustering algorithm (assumption of rate constancy among lineages - often violated) A B B dAB C dACdBC ( AB) C d(AB)C d(AB)C = (dAC + dAB) / 2 Distance matrix Tree dAB / 2 A B A d(AB)C / 2 B C step 1 step 2

9 UPGMA Step 1 combine B and C a e c b d

10 UPGMA step 2 combine BC and D (10+12)/2 (4+6)/2 a e c b d 2 2

11 UPGMA step 3 combine A and E a e c b d

12 UPGMA step 4 combine AE and BCD a e c b d

13 UPGMA Result a e c b d

14 UPGMA Result a e c b d

15 UPGMA(1)

16 UPGMA(2)

17 UPGMA -- Ilustrations

18 When UPGMA fails …

19 Neighbor Joining Very popular method Does not make molecular clock assumption : modified distance matrix constructed to adjust for differences in evolution rate of each taxon Produces unrooted tree Assumes additivity: distance between pairs of leaves = sum of lengths of edges connecting them Like UPGMA, constructs tree by sequentially joining subtrees

20 Additivity

21 Naïve NJ by Additivity? O(n 2 ) (i,j) pairs O(n 2 ) (k,l) pairs (k,l) “rejects” (i,j) whenever additivity fails O(n 4 ) to pick an (i,j) neighbor pair! So totally O(n 5 ) time suffices

22 Neighbor Joining: Once we know the correct (i,j) pair

23 Neighbour Joining: why not pick the smallest (i,j) pair?

24 Neighbour Joining(3) i j

25 Neighbour Joining: Algorithm

26 Neighbor-Joining: Algorithm i j i j k m

27 Neighbor-Joining: Complexity The method performs a search using time O(n 2 ) and using time O(n 2 ) to update distance matrix. Giving a total time complexity of O(n 3 ),and a space complexity of O(n 2 ).

28 Reasoning the NJ Method How did the ideas of S i,j and R i comes from ? How correct is the algorithm? Heuristic or exact solution?

29 The “1-star” Sum of the Branch Lengths D and L as the distance between OTUs and the branch length between nodes Each branch is counted N-1 times when all distances are added

30 The “paired-2-star” Sum of the Branch Lengths

31 The “paired-2-star” Tree Size

32 The Distance and Branch Lengths between a Combined OTU and another One

33 Before the proof

34 Before the proof (Cont.)

35 Neighbor-Joining: The proof

36 Lemma

37 Lemma (Cont.)

38 Proof

39 Proof of the Theorem: by contradiction Suppose that i and j are not neighbors. Let k and l be any pair of neighbors, so that i, j, k, and l are distinct and are represented in the tree.Consider the sum in formula (b), which is nonnegative. If m is fifth OUT, then it joins the tree at point x along one of the indicated arcs. Say that m is of type 1 if it joins the path from I to j at any node different from u and that m is of type 2 if it joins the path from i to j at node u. r Type1: A = -2D ux -2D uv Type2: B = -4D vx +2D uv For the sum in formula b to be nonnegative, Type2 should be more than Type1. i j k l vu x x w s x B A

40 Proof of the theorem (Cont.) If m is of type 1,then the corresponding summand in formula (b) is -2D ux -2D uv. If m is of type 2, then the corresponding summand in formula (b) is -4D vx +2D uv. For the sum in formula (b) to be nonnegative, there must be at least as many terms corresponding to OTUs m of type 2 as there are terms corresponding top OTUs m of type 1. It follows that there are more OTUs that join the path from i to j at u than there are OTUs that join that path at all other nodes combined. Because neither i nor j has a neighbor, there must be a pair r,s of neighbors that argument applied to w that is different from u, By the above argument applied to w, there are more OTUs that join the path from i to j at w than there are OTUs that join that path at all other nodes combined. The conclusions about u and w contradict each other, and the theorem follows.

41 Speeding up Neighbor-Joining Tree Construction In this paper, the authors present several heuristics for speeding up the NJ method. The heuristics attempt to reduce the search time by using a quad-tree. The worst case time complexity remains O(n 3 ) and the space complexity after adding the quad-tree is still O(n 2 ). The authors have implemented a tool, QuickJoin.

42 Previous Work The neighbor-joining method is introduced by Saitou and Nei. The algorithm was later amended by Studier and Keppler with a running time O(n 3 ). BIONJ -- Gascuel et al. produce a O(n 3 ) implementation of a variant of the NJ algorithm that produce more accurate trees in many cases. QuickTree -- Durbin et al. produce an code optimized implementation of the NJ algorithm.

43 Appendix:Proof of neighbour-joining

44 +/- of distance methods Advantages: –easy to perform –quick calculation –fit for sequences having high similarity scores Disadvantages: –the sequences are not considered as such (loss of information) –all sites are generally equally treated (do not take into account differences of substitution rates ) –not applicable to distantly divergent sequences.

45 Parsimony

46 Maximum Parsimony Method principle - search for tree that requires the smallest number of character state changes between the OTUs informative sites - those that favor some trees over others operationally - at least two different kinds of residues at the site, each of which is found in at least two of the OUT sequences T C A G A T C T A G T T A G A A C T A G T T C G A T C G A G T T C T A A G G A C Site OTU

47 Evaluating Parsimony Scores How do we compute the Parsimony score for a given tree? Traditional Parsimony –Each base change has a cost of 1 Weighted Parsimony –Each change is weighted by the score c(a,b)

48 Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a

49 Traditional Parsimony

50 Evaluating Weighted Parsimony Dynamic programming on the tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: if k is node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination: cost of tree is min a S(r,a) where r is the root k i j

51 Example AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

52 Cost of Evaluating Parsimony Score is evaluated on each position independetly. Scores are then summed over all positions. If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

53 Weighted Parsimony

54 Traditional Parsimony is not “complete”

55 Parsimony: Searching over all trees by Branch and Bound

56 Assessing the trees: the bootstrap

57

58 Simultaneous alignment and phylogeny(1)

59 Inferring trees – Maximum Likelihood method Maximum likelihood supposes a model of evolution along tree branches. Strategy: Find parameters ( tree, branch lengths, substitution rate ) that maximizes the likelihood assigned to the data. Note: Model of evolution does not include indels! In Phylip package: program PROTML

60 Probabilistic Methods The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations –Jukes Cantor –Kimura 2-parameter model Such models are used to derive the probabilities

61 Jukes Cantor model A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

62 Kimura 2-parameter model Allows a different rate for transitions and transversions.

63 Mutation Probabilities The rate matrix R is used to derive the mutation probability matrix S: S is obtained by integration. For Jukes Cantor: q can be obtained by setting t to infinity

64 Mutation Probabilities Both models satisfy the following properties: Lack of memory: – Reversibility: –Exist stationary probabilities { P a } s.t. A GT C

65 Probabilistic Approach Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4

66 Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.

67 Tree Likelihood Computation Define P(L k |a)= prob. of leaves below node k given that x k =a Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then Termination:Likelihood is

68 Maximum Likelihood (ML) Score each tree by –Assumption of independent positions Branch lengths t can be optimized –Gradient ascent –EM We look for the highest scoring tree –Exhaustive –Sampling methods (Metropolis)

69 Optimal Tree Search Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima

70 Computational Problem Such procedures are computationally expensive! Computation of optimal parameters, per candidate, requires non-trivial optimization step. Spend non-negligible computation on a candidate, even if it is a low scoring one. In practice, such learning procedures can only consider small sets of candidate structures

71 Max Likelihood versus Parsimony (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by Parsimony NT1T1 T2T2 T3T T1T1 T2T2 T3T3 T NT1T1 T2T2 T3T Conclusion: ML infers right tree as N gets larger, Parsimony does not necessarily.

72 Max Likelihood versus NJ (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by NJ NT1T1 T2T2 T3T T1T1 T2T2 T3T3 T Conclusion: ML infers right tree as N gets largerl. If the probabilistic model is correct, the ML distances shall be very close to additive, therefore the NJ method predicts the correct tree.

73 Phylip - practicalities Menu-driven, no command line Input file format: –First line: –Next lines: Sequences: First ten characters is the sequence name Then sequence follows. Spaces and newlines are allowed. Dashes (-) signify gaps Example: 4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERMFLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGGEALGRLLVVYPWTQRFFESF Myoglobin –MGLSDGEWQLVLNVWGKVE ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS

74 The End