13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount.

13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount „Bioinformatics“ and from Julian Felsenstein‘s book. A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. Placing the sequences as outer branches on a tree, the evolutionary relationships among the sequences are depicted. Phylogenies, or evolutionary trees, are the basic structures to describe differences between species, and to analyze them statistically. They have been around for over 140 years. Statistical, computational, and algorithmic work on them is ca. 40 years old.

13. Lecture WS 2004/05Bioinformatics III2 3 main approaches in single-gene phylogeny - maximum parsimony - distance - maximum likelihood Popular programs: PHYLIP (phylogenetic inference package – J Felsenstein) PAUP (phylogenetic analysis using parsimony – Sinauer Assoc

13. Lecture WS 2004/05Bioinformatics III3 Methods for Single-Gene Phylogeny Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Maximum parsimony methods Yes No Is there clearly recogniza- ble sequence similarity? Yes Distance methods No Maximum likelihood methods Analyze how well data support prediction

13. Lecture WS 2004/05Bioinformatics III4 Parsimony methods Edwards & Cavalli-Sforza (1963): that evolutionary tree is to be preferred that involves „the minimum net amount of evolution“.  seek that phylogeny on which, when we reconstruct the evolutionary events leading to our data, there are as few events as possible. (1) We must be able to make a reconstruction of events, involving as few events as possible, for any proposed phylogeny. (2) We must be able to search among all possible phylogenies for the one or ones that minimize the number of events.

13. Lecture WS 2004/05Bioinformatics III5 A simple example Suppose that we have 5 species, each of which has been scored for 6 characters  (0,1) We will allow changes 0  1 and 1  0. The initial state at the root of a tree may be either state 0 or state 1.

13. Lecture WS 2004/05Bioinformatics III6 Evaluating a particular tree To find the most parsimonious tree, we must have a way of calculating how many changes of state are needed on a given tree. This tree represents the phylogeny of character 1. Reconstruct phylogeny of character 1 on this tree.

13. Lecture WS 2004/05Bioinformatics III7 Evaluating a particular tree There are 2 equally good reconstructions, each involving just one change of character state. They differ in which state they assume at the root of the tree, and they differ in which branch they place the single change.

13. Lecture WS 2004/05Bioinformatics III8 Evaluating a particular tree 3 equally good reconstructions for character 2, which needs two changes of state.

13. Lecture WS 2004/05Bioinformatics III9 Evaluating a particular tree A single reconstruction for character 3, involving one change of state.

13. Lecture WS 2004/05Bioinformatics III10 on the right: 2 reconstructions for character 4 and 5 because these characters have identical patterns. single reconstruction for character 6, one change of state. Evaluating a particular tree

13. Lecture WS 2004/05Bioinformatics III11 Evaluating a particular tree The total number of changes of character state needed on this tree is 1 + 2 + 1 + 2 + 2 + 1 = 9 Reconstruction of the changes in state on this tree

13. Lecture WS 2004/05Bioinformatics III12 Evaluating a particular tree Alternative tree with only 8 changes of state. The minimum number of changes of state would be 6, as there are 6 characters that can each have 2 states. Thus, we have two „extra“ changes  called „homoplasmy“.

13. Lecture WS 2004/05Bioinformatics III13 Evaluating a particular tree Figure right shows another tree also requiring 8 changes. These two most parsimonious trees are the same tree when the roots of the tree are removed.

13. Lecture WS 2004/05Bioinformatics III14 Methods of rooting the tree There are many rooted trees, one for each branch of this unrooted tree, and all have the same number of changes of state. The number of changes of state only depends on the unrooted tree, and not at all on where the tree is then rooted. Biologists want to think of trees as rooted  need method to place the root in an otherwise unrooted tree. (1) Outgroup criterion (2) Use a molecular clock.

13. Lecture WS 2004/05Bioinformatics III15 Outgroup criterion Assumes that we know the answer in advance. Suppose that we have a number of great apes, plus a single old-world monkey. Suppose that we know that the great apes are a monophyletic group. If we infer a tree of these species, we know that the root must be placed on the lineage that connects the old-world monkey (outgroup) to the great apes (ingroup).

13. Lecture WS 2004/05Bioinformatics III16 Molecular clock If an equal amount of changes were observed on all lineages, there should be a point on the tree that has equal amounts of change (branch lengths) from there to all tips. With a molecular clock, it is only the expected amounts of change that are equal. The observed amounts may not be.  using various methods find a root that makes the amounts of change approximately equal on all lineages.

13. Lecture WS 2004/05Bioinformatics III17 Branch lengths Having found an unrooted tree, locate the changes on it and find out how many occur in each of the branches. The location of the changes can be ambiguous.  average over all possible reconstructions of each character for which there is ambiguity in the unrooted tree. Fractional numbers in some branches of left tree add up to (integer) number of changes (right)

13. Lecture WS 2004/05Bioinformatics III18 Open questions * Particularly for larger data sets, need to know how to count number of changes of state by use of an algorithm. * need to know algorithm for reconstructing states at interior nodes of the tree. * need to know how to search among all possible trees for the most parsimonious ones, and how to infer branch lengths. * sofar only considered simple model of 0/1 characters. DNA sequences have 4 states, protein sequences 20 states. * Justification: is it reasonable to use the parsimony criterion? If so, what does it implicitly assume about the biology? * What is the statistical status of finding the most parsimonious tree? Can we make statements how well-supported it is compared to other trees?

13. Lecture WS 2004/05Bioinformatics III19 Counting evolutionary changes 2 related dynamic programming algorithms: Fitch (1971) and Sankoff (1975) - evaluate a phylogeny character by character - for each character, consider it as rooted tree, placing the root wherever seems appropriate. - update some information down a tree; when we reach the bottom, the number of changes of state is available. Do not actually locate changes or reconstruct interior states at the nodes of the tree.

13. Lecture WS 2004/05Bioinformatics III20 Fitch algorithm intended to count the number of changes in a bifurcating tree with nucleotide sequence data, in which any one of the 4 bases (A, C, G, T) can change to any other. At the particular site, we have observed the bases C, A, C, A and G in the 5 species. Give them in the order in which they appear in the tree, left to right.

13. Lecture WS 2004/05Bioinformatics III21 Fitch algorithm For the left two, at the node that is their immediate common ancestor, attempt to construct the intersection of the two sets. But as {C}  {A} =  instead construct the union {C}  {A} = {AC} and count 1 change of state. For the rightmost pair of species, assign common ancestor as {AG}, since {A}  {G} =  and count another change of state..... proceed to bottom Total number of changes = 3. Algorithm works on arbitrarily large trees.

13. Lecture WS 2004/05Bioinformatics III22 Complexity of Fitch algorithm Fitch algorithm can be carried out in a number of operations that is proportional to the number of species (tips) on the tree. Don‘t we need to multiply this by the number of sites n ? Any site that is invariant (which has the same base in all species, e.g. AAAAA) can be dropped. Other sites with a single variant base (e.g. ATAAA) will only require a single change of state on all trees. These too can be dropped. For sites with the same pattern (e.g. CACAG) that we have already seen, simply use number of changes previously computed. Pattern following same symmetry (e.g. TCTCA = CACAG) need same number of changes  numerical effort rises slower than linearly with the number of sites.

13. Lecture WS 2004/05Bioinformatics III23 Sankoff algorithm Fitch algorithm is very effective – but we can‘t understand why it works. Sankoff algorithm: more complex, but its structure is more apparent. Assume that we have a table of the cost of changes c ij between each character state i and each other state j. Compute the total cost of the most parsimonious combinations of events by computing it for each character. For a given character, compute for each node k in the tree a quantity S k (i). This is interpreted as the minimal cost, given that node k is assigned state i, of all the events upwards from node k in the tree.

13. Lecture WS 2004/05Bioinformatics III24 Sankoff algorithm If we can compute these values for all nodes, we can also compute them for the bottom node in the tree. Simply choose the minimum of these values which is the desired total cost we seek, the minimum cost of evolution for this character. At the tips of the tree, the S(i) are easy to compute. The cost is 0 if the observed state is state i, and infinite otherwise. If we have observed an ambigous state, the cost is 0 for all states that it could be, and infinite for the rest. Now we just need an algorithm to calculate the S(i) for the immediate common ancestor of two nodes.

13. Lecture WS 2004/05Bioinformatics III25 Sankoff algorithm Suppose that the two descendant nodes are called l and r (for „left“ and „right“). For their immediate common ancestor, node a, we compute The smallest possible cost given that node a is in state i is the cost c ij of going from state i to state j in the left descendant lineage, plus the cost S l (j) of events further up in the subtree gien that node l is in state j. Select value of j that minimizes that sum. Same calculation for right descendant lineage  sum of these two minima is the smallest possible cost for the subtree above node a, given that node a is in state i. Apply equation successively to each node in the tree, working downwards. Finally compute all S 0 (i) and use previous eq. to find minimum cost for whole tree.

13. Lecture WS 2004/05Bioinformatics III26 Sankoff algorithm The array (6,6,7,8) at the bottom of the tree has a minimum value of 6 = minimum total cost of the tree for this site.

13. Lecture WS 2004/05Bioinformatics III27 Finding the best tree by heuristic search The obvious method for searching for the most parsimonious tree is to consider ALL trees and evaluate each one. Unfortunately, generally the number of possible trees is too large.  use heuristic search methods that attempt to find the best trees without looking at all possible trees. (1) Make an initial estimate of the tree and make small rearrangements of it = find „neighboring“ trees. (2) If any of these neighbors are better, consider them and continue search.

13. Lecture WS 2004/05Bioinformatics III28 Nearest-neighbor interchanges

13. Lecture WS 2004/05Bioinformatics III29 Nearest-neighbor interchanges

13. Lecture WS 2004/05Bioinformatics III30 Subtree pruning and regrafting

13. Lecture WS 2004/05Bioinformatics III31 Branch-and-Bound find global optimum, NP-hard problem

13. Lecture WS 2004/05Bioinformatics III32 Resolve Incongruences in Phylogeny Many possible reasons that may make decisions on how to handle conflicts in larger sets of molecular data difficult. E.g. two genes with different evolutionary history (e.g. owing to hybridization or horizontal transfer) will necessarily give incongruent pictures while still depicting true histories. Here: compare genome sequence data for 7 Saccharomyces yeast species: S. cerevisae S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. castelli S. kluyveri plus one outgroup fungus Candida albicans. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III33 Resolve Incongruences in Phylogeny Identify orthologous genes to serve as phylogenetic markers: 106 genes which are distributed throughout the S. cerevisae genome on all 16 chromosomes and comprise a total length of 127026 nt = 42342 amino acids corresponding to roughly 1% of the genomic sequence and 2% of the predicted genes. Criteria to select genes spaced ca. every 40 kb: (1) genes have homologous sequence in each of the 8 species (2) genes have at least two homologous flanking syntenic genes (3) genes can be aligned over most of the protein. 3 types of analysis: - maximum likelihood (ML) analysis of nucleotide data - maximum parsimony (MP) analysis of nucleotide data - MP of the amino acid data Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III34 Resolve Incongruences in Phylogeny Align individual genes with ClustalW. Edit manually to exclude indels and areas of uncertain alignment  left with 76% of the sequence of each gene on average. Tree construction with PAUP by branch-and-bound algorithm which guarantees to find the optimal tree. Estimate tree reliability using non-parametric bootstrap resampling. Analysis of the 106 genes gave more than 20 alternative ML or MP trees. Generate 50% majority-rule consensus trees by bootstrapping. Next slide shows several strongly supported trees. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III35 A method for testing how well a particular data set fits a model. E.g. the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to determine how much the sequence influences the results of an analysis. Here: swap individual nucleotide sites or positions of genes (bootstrap replicas). Bootstrap analysis.

13. Lecture WS 2004/05Bioinformatics III36 Alternative Tree topologies Single-gene data sets generate multiple, robustly supported alternative topologies. Representative alternative trees recovered from analyses of nucleotide data of 106 selected single genes and six commonly used genes are shown. The trees are the 50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b), YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f). These 6 genes were selected without consideration of their function. Maybe commonly used, well known genes of important functions provide a better resolution? Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III37 Results from the commonly used genes actin (g), hsp70 (h),  -tubulin (i), RNA polymerase II (j) elongation factor 1-  (k) and 18S rDNA (l). Numbers above branches indicate bootstrap values (ML on nucleotides/MP on nucleotides).  Same problem of alternative topologies as before. Alternative Tree topologies Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III38 The alternative phylogenies could have resulted from a number of different scenarios: (1) most genes could have weakly supported most phylogenies and strongly supported only a few alternative trees, (2) most genes could have strongly supported one phylogeny and a few genes strongly supported only a small number of alternatives, (3) there could have been some combinations of these scenarios so that each branch among alternative phylogenies had either weak or strong support depending on the gene. To distinguish between these possibilities, identify all branches recovered during single-gene analyses, record each bootstrap value with respect to the gene and method of analysis.  8 branches were shared by all three analyses with multiple instances of bootstrap values > 50%. Explanations? Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III39 Common Branches The distribution of bootstrap values for the eight prevalent branches recovered from 106 single-gene analyses highlights the pervasive conflict among single- gene analyses. a, Majority-rule consensus tree of the 106 ML trees derived from single-gene analyses. Across all analyses, there were eight commonly observed branches; the five branches in the consensus tree (numbers 1–5; a) and the three branches (numbers 6–8) shown in b. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III40 c, For each of the eight branches, the ranked distribution of per cent bootstrap values recovered from the three analyses of 106 genes is shown. Results from ML (blue) and MP (red) analyses of nucleotide data sets, and MP analyses of amino acid data sets (black), are shown. For each branch, the mean bootstrap value and 95% confidence intervals from the ML analyses and the percentage of ML trees supporting this branch (in parentheses) are indicated below each graph. Although the ranked distributions of bootstrap values from the three analyses are remarkably similar for most branches, on a gene-by-gene basis there is no tight correspondence between bootstrap values from ML and MP analyses Bootstrap Values of Common Branches Only branches 1 and 4 are supported by a majority of genes. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III41 How different are the trees? The degree of conflict among the trees could be relatively minor. Determine how many taxa (genes) would need to be removed to make two trees congruent (deckungsgleich). Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III42 Reversal distance problem Extensive incongruence between trees derived from the 106 individual-gene data sets. Pairwise comparisons between 50% majority-rule consensus trees from 106 single-gene ML analyses of nucleotide data (black bars), MP analyses of nucleotide data (white bars), and MP analyses of amino acid data (grey bars) were categorized on the basis of the minimum number of taxa that need to be removed for two trees to reach congruence (x axis). For each of the analyses, the majority of pairwise comparisons require the removal of two or more taxa before congruence is attained. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III43 What leads to incongruence? Many factors were checked that could lead to incongruence between single-gene phylogenies: - outgroup choice repeat all analyses without C. albicans - number of variable sitessignificantly correlated with - number of parsimony-informative sitesbootstrap values for some - gene sizebranches - rate of evolution - nucleotide composition - base compositional bias - genome location - gene ontology no parameters can systematically account for or predict the performance of single genes! } Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III44 Can incongruence be overcome? Although we do not know the cause(s) of incongruence between single-gene phylogenies, the critical question is how this incongruence between single trees might be overcome to arrive at the actual species tree. Can single gene trees be concatenated into one large data set? Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III45 Concatenation of single genes gives a single tree! Phylogenetic analyses of the concatenated data set composed of 106 genes yield maximum support for a single tree, irrespective of method and type of character evaluated. Numbers above branches indicate bootstrap values (ML on nucleotides/MP on nucleotides/MP on amino acids). All alternative topologies were rejected. This level of support for a single tree with 5 internal branches is unprecedented. This tree can now be referred to as species tree. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III46 How much data is required? The concatanated data recovered a tree with maximum support on all branches, despite divergent levels of support for each branch among single-gene analyses.  At what size did the data set arrive at the species tree? Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III47 Convergence on single tree A minimum of 20 genes is required to recover >95% bootstrap values for each branch of the species tree. a, b, The bootstrap values for branches 3 (a) and 5 (b) were constructed from the concatenation of randomly re-sampled orthologous nucleotides (left) or random subsets of genes (right). The species tree is recovered with robust support (>95% bootstrap values in all branches at 95% confidence interval) by analyses of a minimum of 20 concatenated genes. All analyses were performed using MP. branch 3 branch 5 Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III48 Independent evolution? It has been suggested that nucleotides within a given gene do not evolve independently. Re-sample subset of orthologous nucleotides from the total data set. Only 3000 randomly chosen nucleotide positions (corresponding to less than three concatenated genes) are sufficient to generate single tree with > 95% confidence. This indicates that nucleotides in genes have not evolved independently (because when using complete genes more than 20 genes are necessary to generate single tree). Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III49 Implications for resolution of phylogenies Unreliability of single-gene data sets stems from the fact that each gene is shaped by a unique set of functional constraints through evolution. Phylogenetic algorithms are sensitive to such constraints. Such problems can be avoided with genome-wide sampling of independently evolving genes. In other cases the amount of sequence information needed to resolve specific relationships will be dependent on the particular phylogenetic history under examination. Branches depicting speciation events separated by long time intervals may be resolved with a smaller amount of data, and those depicting speciation events separated by shorter invtervals may be much harder to resolve. Rokas et al. Nature 425, 798 (2003)

13. Lecture WS 2004/05Bioinformatics III50 Summary Robust strategies exist for phylogenies built on single-gene comparisons (maximum parsimony, distance, maximum likelihood). Problem of incongruence of phylogenies derived from individual genes. Can be resolved by integrative analysis of multiple (here > 20) genes. It is desirable to combine results from phylogenies constructed from local sequence information with trees constructed from genome rearrangement. The power of genome rearrangement studies is the construction of ancestral genomes. Then one can derive the speed of evolution at different times, disect mutation biases at different times from the influence of genomic context... and possibly derive the driving forces of biological evolution.

13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount.

Similar presentations

Presentation on theme: "13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount.

Similar presentations

Presentation on theme: "13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount."— Presentation transcript:

Similar presentations

About project

Feedback