. Phylogenetic Trees Lecture 3 Based on: Durbin et al 7.4; Gusfield 17.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
. Phylogenetic Trees - Parsimony Tutorial #12 Next semester: Project in advanced algorithms for phylogenetic reconstruction (236512) Initial details in:
. Perfect Phylogeny Tutorial #11 © Ilan Gronau Original slides by Shlomo Moran.
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
NJ was originally described as a method for approximating a tree that minimizes the sum of least- squares branch lengths – the minimum – evolution criterion.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Parsimony Anders Gorm Pedersen
. Phylogenetic Trees - Parsimony Tutorial #11 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Perfect Phylogeny MLE for Phylogeny Lecture 14
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic trees Sushmita Roy BMI/CS 576
Processing & Testing Phylogenetic Trees. Rooting.
What Is Phylogeny? The evolutionary history of a group.
Maximum parsimony Kai Müller.
Terminology of phylogenetic trees
Molecular phylogenetics
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
PARSIMONY ANALYSIS and Characters. Genetic Relationships Genetic relationships exist between individuals within populations These include ancestor-descendent.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Phylogenetic Trees - Parsimony Tutorial #13
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Parsimony and searching tree-space. The basic idea To infer trees we want to find clades (groups) that are supported by synapomorpies (shared derived.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic Trees - Parsimony Tutorial #12
Character-Based Phylogeny Reconstruction
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood
Backtracking and Branch-and-Bound
Phylogeny.
PARSIMONY ANALYSIS.
CS 394C: Computational Biology Algorithms
Perfect Phylogeny Tutorial #10
Presentation transcript:

. Phylogenetic Trees Lecture 3 Based on: Durbin et al 7.4; Gusfield 17

2 Character-based methods for constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (homologous DNA sequences). One common approach is Maximum Parsimony. Assumptions: u Independence of characters (no interactions) u Best tree is one where minimal changes take place

3 1. Maximum Parsimony Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ? AGA AAA GGA AAG AAA Total #substitutions = 4 One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree.

4 Example Continued There are many trees possible. For example: AGA GGA AAA AAG AAA AGA AAA Total #substitutions = 3 GGA AAA AGA AAG AAA Total #substitutions = 4 The left tree is preferred over the right tree. The total number of changes is called the parsimony score.

5 Simple Example u Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position u Minimal tree has one evolutionary change: C C C C C T T T T  C

6 Extension to Many Letters u What is the parsimony score of AardvarkBisonChimpDogElephant A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA We do it character after character; each score is computed independently of the others.

7 Fitch’s Algorithm of Evaluating Trees Traverse tree from leaves to root determining set of possible states (e.g. nucleotides) for each internal node Traverse tree from root to leaves picking ancestral states for internal nodes

8 Fitch’s Algorithm – Step 1  # of changes = # union operations T T CT T C T A G T AGT GT

9 Fitch’s Algorithm – Step 1  D o a post-order (from leaves to root) traversal of tree  Determine possible states R i of internal node i with children j and k

10 Fitch’s Algorithm – Step 2 T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CTCT T C T A G T AGT GT T T CTCT T C T A G T AGT GTGT

11 Fitch’s Algorithm – Step 2 Do a pre-order (from root to leaves) traversal of tree Select state r j of internal node j with parent i

12 Weighted Version of Fitch’s Algorithm I nstead of assuming all state changes are equally likely, use different costs c(a, b) for different changes 1 st step of algorithm is to propagate costs up through tree

13 Weighted Version of Fitch’s Algorithm Want to determine minimal cost S(i, a) of assigning character a to node i For leaves:

14 Weighted Version of Fitch’s Algorithm W ant to determine min. cost S(i, a) of assigning character a to node i For internal nodes: a b i j k

15 Weighted Version of Fitch’s Algorithm – Step 2 D o a pre-order (from root to leaves) traversal of tree Select minimal cost character for root For each internal node j, select character that produced minimal cost at parent i

16 Weighted Parsimony Scores Weighted Parsimony score: Each change is weighted by a score c(a, b). The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b  a.

17 Evaluating Weighted Parsimony Scores Each position is independent and computed by itself. Use Dynamic Programming on a given tree. u If k is a node with children i and j, then S(i, a) = min x (S(j, x)+c(a, x)) + min y (S(k, y)+c(a, y)) i j k S(j,x) S(i, a)  the minimum score of subtree rooted at k when k has character a. S(k,y) S(i,a)

18 Evaluating Parsimony Scores Dynamic programming on a given tree Initialization:  For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration:  if i is node with children j and k, then S(i,a) = min x (S(j,x)+c(a,x)) + min y (S(k,y)+c(a,y)) Termination:  cost of tree is min x S(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a the two characters x, y that bring about the minimum when i has character a.

19 Cost of Evaluating Parsimony for binary trees If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk 2 ). Of course, we still need to search over ALL possible trees and find the best one. One usually resorts to heuristic search techniques.

20 Exploring the Space of Trees W e’ve considered how to find the minimum number of changes for a given tree topology Need some search procedure for exploring the space of tree topologies Given n sequences there are possible rooted trees

21 Counting Trees n = 3 One Tree: n = 4 3 Trees A rooted tree with n leaves has (2n-1) nodes and (2n-2) edges, discounting the edge to the root; hence an unrooted tree has (2n-3) edges. For each additional leaf we add two edges. Therefore we have … (2n-5) unrooted trees with n leaves. Each of such trees has (2n-3) edges, which can be chosen as a root of the rooted tree. Hence we have … (2n-5) (2n-3) rooted trees with n leaves

22 Exploring the Space of Trees taxa (n) # of rooted trees , ,405,375

23 Maximum Parsimony Species 1 – A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

24 Maximum Parsimony How many possible unrooted trees? Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

25 Maximum Parsimony How many substitutions? MP

26 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 0 0

27 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3

28 Maximum Parsimony 1 - G 2 - C 3 - T 4 - A A G C T C A G T C C C G A T C 3 3 3

29 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2

30 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

31 Maximum Parsimony G 2 - A 3 - A 4 - G G G A A A G G A A A A G GA A 2 2 1

32 Maximum Parsimony

33 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

34 Finding most parsimonious trees - exact solutions u Exact solutions can only be used for small numbers of taxa. u Exhaustive search examines all possible trees. u Typically used for problems with less than 10 taxa.

35 Finding most parsimonious trees - exhaustive search A B C (1) (2a) Starting tree, any 3 taxa A B D C A B D C (2b) (2c) E A B C D E E E E Add fourth taxon (D) in each of three possible positions: three trees Add fifth taxon (E) in each of the five possible positions on each of the three trees -> 15 trees, and so on

36 Finding most parsimonious trees - exact solutions u Branch and bound saves time by discarding families of trees during tree construction that can not be smaller than the smallest tree found so far. (Here “smaller” means more parsimonious.) u Can be enhanced by specifying an initial upper bound for tree length. u Typically used only for problems with less than 20 taxa.

37 Finding most parsimonious trees: branch and bound A B C B1 A B D C A B C D B3 A A BE D C C1.1 A B D E CC1.3 A BD C EC1.2 A B C C1.4 ED A BC C1.5 E D A B D C B2 C2.1 C2.2 C2.3 C2.4 C2.5 C3.1 C3.2 C3.3 C3.4 C3.5

38 Finding most parsimonious trees - heuristics u The number of possible trees increases exponentially with the number of taxa making exhaustive searches impractical for many data sets (an NP complete problem) u Heuristic methods are used to search tree space for most parsimonious trees u The trees found are not guaranteed to be the most parsimonious - they are best guesses

39 Finding most parsimonious trees - heuristics u Stepwise addition Asis - the order in the data matrix Closest -starts with shortest 3-taxon tree adds taxa in order that produces the least increase in tree length Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference Random - taxa are added in a random sequence, many different sequences can be used u Recommend random with as many (e.g ) addition sequences as practical

40 Finding most parsimonious trees - heuristics Branch Swapping: Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR)

41 Finding most parsimonious trees - heuristics 1 Nearest neighbor interchange (NNI) A B CD E F G A B DC E F G A B CD E F G

42 Finding most parsimonious trees - heuristics 2 Subtree pruning and regrafting (SPR) A B CD E F G A B CD E F G C D G B A E F

43 Finding most parsimonious trees - heuristics 3 Tree bisection and reconnection (TBR) A B CD E F G A B C D E F G A C F D E B G

44 Finding most parsimonious trees - heuristics - summary u Branch Swapping Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) u The nature of heuristic searches means we cannot know which method will find the most parsimonious trees or all such trees. u However, TBR is the most extensive swapping routine and its use with multiple random addition sequences should work well.

45 Tree space may be populated by local minima and islands of most parsimonious trees GLOBAL MINIMUM Local Minimum Local Minima Tree Length RANDOM ADDITION SEQUENCE REPLICATES SUCCESSFAILURE Branch Swapping Branch Swapping

46 Multiple most parsimonious trees u Many parsimony analyses yield multiple equally optimal trees u Multiple trees are due to either: - Alternative equally parsimonious optimizations of homoplastic characters - Missing data - Or both u We can further select among these trees with additional criteria, but u Most commonly relationships common to all the optimal trees are summarized with consensus trees

47 Consensus methods - 1 u A consensus tree is a summary of the agreement among a set of fundamental trees u There are many different consensus methods that differ in: 1. the kind of agreement 2. the level of agreement u Consensus methods can be used with any types of tree - not just parsimony

48 Strict consensus methods - 1 u Strict consensus methods require agreement across all the fundamental trees u They show only those relationships that are unambiguously supported by the parsimonious interpretation of the data u The commonest method (strict component consensus) focuses on clades u This method produces a consensus tree that includes all and only those clades found in all the fundamental trees u Other relationships (those in which the fundamental trees disagree) are shown as unresolved polytomies

49 Strict consensus methods - 2 ABCDEFG A B C E D FG TWO FUNDAMENTAL TREES A B C D E FG STRICT COMPONENT CONSENSUS TREE

50 Majority-rule consensus methods u Majority-rule consensus methods require agreement across a majority of the fundamental trees u May include relationships that are not supported by the most parsimonious interpretation of the data u The commonest method focuses on clades u This method produces a consensus tree that includes all and only those clades found in a majority (>50%) of the fundamental trees u Other relationships are shown as unresolved polytomies u Of particular use in bootstrapping

51 Majority rule consensus ABCDEFG A B C E D FG ABCEDFG MAJORITY-RULE COMPONENT CONSENSUS TREE A B C E F DG THREE FUNDAMENTAL TREES Numbers indicate frequency of clades in the fundamental trees

52 Reduced consensus methods - 1 u Focuses upon any cladistic relationships (statements that some taxa are more closely related to each other than to some other taxa) u Reduced consensus methods occur in strict and majority-rule varieties u Other relationships are shown as unresolved polytomies u May be more sensitive than methods focusing only on clades

53 Reduced consensus methods - 2 A B C D E FG TWO FUNDAMENTAL TREES STRICT REDUCED CLADISTIC CONSENSUS TREE Taxon G is excluded AGBCDEF A B C D E F A B C DE F G Strict component consensus completely unresolved

54 Consensus methods - 2 Spirostomumum Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomumum Euplotes Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Euplotes Spirostomumum Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Spirostomum Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Euplotes Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Tracheloraphis Gruberia Three fundamental trees majority-rule strict (component) strict reduced cladistic Euplotes excluded

55 Consensus methods - 3 u Use strict methods to identify those relationships unambiguously supported by parsimonious interpretation of the data u Use reduced methods where consensus trees are poorly resolved u Use majority-rule methods in bootstrapping u Avoid other methods which have ambiguous interpretations

56 Parsimony - advantages u a simple method - easily understood operation u does not seem to depend on an explicit model of evolution u gives both trees and associated hypotheses of character evolution u should give reliable results if the data is well structured and homoplasy is either rare or randomly distributed on the tree

57 Parsimony - disadvantages u May give misleading results if homoplasy is common or concentrated in particular parts of the tree, e.g: - thermophilic convergence - base composition biases - long branch attraction u Underestimates branch lengths u Model of evolution is implicit - behaviour of method not well understood u Parsimony often justified on purely philosophical grounds - we must prefer simplest hypotheses - particularly by morphologists u For most molecular systematists this is uncompelling

58 Parsimony can be inconsistent u Felsenstein (1978) developed a simple model phylogeny including four taxa and a mixture of short and long branches u Under this model parsimony will give the wrong tree With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent. Advocates of parsimony initially responded by claiming that Felsenstein’s result showed only that his model was unrealistic. It is now recognized that the long-branch attraction (the Felsenstein Zone) is one of the most serious problems in phylogenetic inference. Long branches are attracted but the similarity is homoplastic AB C D Model tree p p q qq Rates or Branch lengths p >> q A B C D Parsimony tree Wrong

59 2. Perfect Phylogeny Data on species is given by a Character State Matrix. Cell (p, i) has value j iff character i of object (species) p has state j. Goal: constructing evolution tree for the species. Character Objectc1c1 c2c2 c3c3 c4c4 c5c5 A11200 B20121 C32331 D03410 E11001

60 Motivation: Evolution Tree Internal nodes correspond to speciation events, where some character (attribute) is acquired. Assumptions: 1. No reversals (characters are not lost) 2. No convergences (a character is created only once)

61

62 Perfect Phylogeny for a 0-1 Matrix A 0-1 matrix: Each character is either 0 (non exists) or 1 (exists). u Each of the n objects label exactly one leaf of T u Each of the m characters labels exactly one edge of T u Object p has exactly the characters labeling the path from p to the root. A perfect phylogeny for the matrix: Tree with no convergence, no reversals A11000 B00100 C11001 D00110 E01000 A E D C B

63 The (Binary) Perfect Phylogeny Problem Problem: Given a 0-1 matrix M, determine if it has a perfect phylogeny, and construct one if it does. (Note: edges are labeled by characters: edge labeled by i represent changing character i’s state from 0 to 1) A11000 B00100 C11001 D00110 E01000 A E D C B

64 Solution to Perfect Phylogeny Problem Definition: Given a 0-1 matrix M, O k ={j: M jk =1}; i.e., O k is the set of objects that have character k. Theorem: M has a perfect phylogenetic tree iff the sets {O i } are laminar, ie: for all i, j, either O i and O j are disjoint, or one includes the other A11000 B00100 C11001 D00110 E A11000 B00101 C11001 D00110 E01001 LaminarNot Laminar

65 Proof  : Assume M has a perfect phylogeny, and let i, j be given. Consider the edges labeled i and j. Case 1: There is a root to leaf path containing both. Then one is included in the other (2 and 1 below). Case 2: not case 1. Then they are disjoint (2 and 3 below). A E D C B

66 Proof (cont.)  : Assume for all i, j, either O i and O j are disjoint, or one includes the other. We prove by induction on the number of characters that it has. Basis: one character. Then there are at most two objects, one with and one without this character. 0B 1A 1 1 AB

67 Proof (cont.)  : Induction step: Assume correctness for n-1 characters, and consider a matrix with n characters (non-zero columns). WLOG assume that O 1 is not contained in O j for j > 1. Let S 1 be the set of objects that have character 1, and S 2 be the remaining objects. Then each character belongs to objects in S 1 or S 2, but not both. By induction there are trees T 1 and T 2 for S 1 and S 2. Combining them as below gives the desired tree A11000 B00100 C11001 D00110 E10000 T1T1 T2T2 1

68 Efficient Implementation 1. Sort the columns by decreasing value when considered as binary numbers. (Time complexity: O(mn), using radix sort). Claim: If the binary value of column i is larger than that of column j, then O i is not a proper subset of O j. Proof: O i – O j > 0 means the 1’s in O i are not covered by the 1’s in O j A11000 B00100 C11001 D00110 E A11000 B00100 C11010 D00101 E10000

69 Efficient Implementation (2) 2. Make a backwards linked list of the 1’s in each row (leftmost 1 in each row points at itself). Time complexity: O(mn) E 10100D 01011C 00100B 00011A Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. Can be checked in O(mn) time.

70 Examples 00001E 10100D 01011C 00100B 00011A laminar 01101E 10100D 01011C 00100B 00011A Not laminar

71 Efficient Implementation (3) 3. When the matrix is laminar, the tree edges corresponding to characters are defined by the backwards links in the matrix A11000 B00100 C11010 D00101 E10000 A E D C B remaining edges and leaves are determined by the characters of each object. Needs O(mn) time.