4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik1 V4 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Parsimony methods the evolutionary tree to be preferred involves ‘the minimum amount of evolution’ Edwards & Cavalli-Sforza Reconstruct all evolutionary.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Phylogenetic Trees - Parsimony Tutorial #13
13. Lecture WS 2004/05Bioinformatics III1 V13 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Linear Algebra Review.
Evolutionary genomics can now be applied beyond ‘model’ organisms
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
CS 581 Tandy Warnow.
Chapter 19 Molecular Phylogenetics
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik1 V4 Prediction of Phylogenies based on single genes Material of this lecture taken from - chapter 6, DW Mount „Bioinformatics“ and from Julian Felsenstein‘s book. A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. Placing the sequences as outer branches on a tree, the evolutionary relationships among the sequences are depicted. Phylogenies, or evolutionary trees, are the basic structures to describe differences between species, and to analyze them statistically. They have been around for over 140 years. Statistical, computational, and algorithmic work on them is ca. 40 years old.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik2 3 main approaches in single-gene phylogeny - maximum parsimony - distance matrix - maximum likelihood (not covered here) Popular programs: PHYLIP (phylogenetic inference package – J Felsenstein) PAUP (phylogenetic analysis using parsimony – Sinauer Assoc

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik3 Methods for Single-Gene Phylogeny Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Maximum parsimony methods Yes No Is there clearly recogniza- ble sequence similarity? Yes Distance methods No Maximum likelihood methods Analyze how well data support prediction

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik4 Parsimony methods Edwards & Cavalli-Sforza (1963): that evolutionary tree is to be preferred that involves „the minimum net amount of evolution“.  seek that phylogeny on which, when we reconstruct the evolutionary events leading to our data, there are as few events as possible. (1) We must be able to make a reconstruction of events, involving as few events as possible, for any proposed phylogeny. (2) We must be able to search among all possible phylogenies for the one or ones that minimize the number of events.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik5 A simple example Suppose that we have 5 species, each of which has been scored for 6 characters  (0,1) We will allow changes 0  1 and 1  0. The initial state at the root of a tree may be either state 0 or state 1.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik6 Evaluating a particular tree To find the most parsimonious tree, we must have a way of calculating how many changes of state are needed on a given tree. This tree represents the phylogeny of character 1. Reconstruct phylogeny of character 1 on this tree.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik7 Evaluating a particular tree There are 2 equally good reconstructions, each involving just one change of character state. They differ in which state they assume at the root of the tree, and they differ in which branch they place the single change.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik8 Evaluating a particular tree 3 equally good reconstructions for character 2, which needs two changes of state.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik9 Evaluating a particular tree A single reconstruction for character 3, involving one change of state.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik10 on the right: 2 reconstructions for character 4 and 5 because these characters have identical patterns. single reconstruction for character 6, one change of state. Evaluating a particular tree

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik11 Evaluating a particular tree The total number of changes of character state needed on this tree is = 9 Reconstruction of the changes in state on this tree

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik12 Evaluating a particular tree Alternative tree with only 8 changes of state. The minimum number of changes of state would be 6, as there are 6 characters that can each have 2 states. Thus, we have two „extra“ changes  called „homoplasmy“.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik13 Evaluating a particular tree Figure right shows another tree also requiring 8 changes. These two most parsimonious trees are the same tree when the roots of the tree are removed.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik14 Methods of rooting the tree There are many rooted trees, one for each branch of this unrooted tree, and all have the same number of changes of state. The number of changes of state only depends on the unrooted tree, and not at all on where the tree is then rooted. Biologists want to think of trees as rooted  need method to place the root in an otherwise unrooted tree. (1) Outgroup criterion (2) Use a molecular clock.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik15 Outgroup criterion Assumes that we know the answer in advance. Suppose that we have a number of great apes, plus a single old-world monkey. Suppose that we know that the great apes are a monophyletic group. If we infer a tree of these species, we know that the root must be placed on the lineage that connects the old-world monkey (outgroup) to the great apes (ingroup).

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik16 Molecular clock If an equal amount of changes were observed on all lineages, there should be a point on the tree that has equal amounts of change (branch lengths) from there to all tips. With a molecular clock, it is only the expected amounts of change that are equal. The observed amounts may not be.  using various methods find a root that makes the amounts of change approximately equal on all lineages.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik17 Branch lengths Having found an unrooted tree, locate the changes on it and find out how many occur in each of the branches. The location of the changes can be ambiguous.  average over all possible reconstructions of each character for which there is ambiguity in the unrooted tree. Fractional numbers in some branches of left tree add up to (integer) number of changes (right)

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik18 Open questions * Particularly for larger data sets, need to know how to count number of changes of state by use of an algorithm. * need to know algorithm for reconstructing states at interior nodes of the tree. * need to know how to search among all possible trees for the most parsimonious ones, and how to infer branch lengths. * sofar only considered simple model of 0/1 characters. DNA sequences have 4 states, protein sequences 20 states. * Justification: is it reasonable to use the parsimony criterion? If so, what does it implicitly assume about the biology? * What is the statistical status of finding the most parsimonious tree? Can we make statements how well-supported it is compared to other trees?

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik19 Counting evolutionary changes 2 related dynamic programming algorithms: Fitch (1971) and Sankoff (1975) - evaluate a phylogeny character by character - for each character, consider it as rooted tree, placing the root wherever seems appropriate. - update some information down a tree; when we reach the bottom, the number of changes of state is available. Do not actually locate changes or reconstruct interior states at the nodes of the tree.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik20 Fitch algorithm intended to count the number of changes in a bifurcating tree with nucleotide sequence data, in which any one of the 4 bases (A, C, G, T) can change to any other. At the particular site, we have observed the bases C, A, C, A and G in the 5 species. Give them in the order in which they appear in the tree, left to right.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik21 Fitch algorithm For the left two, at the node that is their immediate common ancestor, attempt to construct the intersection of the two sets. But as {C}  {A} =  instead construct the union {C}  {A} = {AC} and count 1 change of state. For the rightmost pair of species, assign common ancestor as {AG}, since {A}  {G} =  and count another change of state..... proceed to bottom Total number of changes = 3. Algorithm works on arbitrarily large trees.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik22 Complexity of Fitch algorithm Fitch algorithm can be carried out in a number of operations that is proportional to the number of species (tips) on the tree. Don‘t we need to multiply this by the number of sites n ? Any site that is invariant (which has the same base in all species, e.g. AAAAA) can be dropped. Other sites with a single variant base (e.g. ATAAA) will only require a single change of state on all trees. These too can be dropped. For sites with the same pattern (e.g. CACAG) that we have already seen, simply use number of changes previously computed. Pattern following same symmetry (e.g. TCTCA = CACAG) need same number of changes  numerical effort rises slower than linearly with the number of sites.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik23 Sankoff algorithm Fitch algorithm is very effective – but we can‘t understand why it works. Sankoff algorithm: more complex, but its structure is more apparent. Assume that we have a table of the cost of changes c ij between each character state i and each other state j. Compute the total cost of the most parsimonious combinations of events by computing it for each character. For a given character, compute for each node k in the tree a quantity S k (i). This is interpreted as the minimal cost, given that node k is assigned state i, of all the events upwards from node k in the tree.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik24 Sankoff algorithm If we can compute these values for all nodes, we can also compute them for the bottom node in the tree. Simply choose the minimum of these values which is the desired total cost we seek, the minimum cost of evolution for this character. At the tips of the tree, the S(i) are easy to compute. The cost is 0 if the observed state is state i, and infinite otherwise. If we have observed an ambigous state, the cost is 0 for all states that it could be, and infinite for the rest. Now we just need an algorithm to calculate the S(i) for the immediate common ancestor of two nodes.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik25 Sankoff algorithm Suppose that the two descendant nodes are called l and r (for „left“ and „right“). For their immediate common ancestor, node a, we compute The smallest possible cost given that node a is in state i is the cost c ij of going from state i to state j in the left descendant lineage, plus the cost S l (j) of events further up in the subtree gien that node l is in state j. Select value of j that minimizes that sum. Same calculation for right descendant lineage  sum of these two minima is the smallest possible cost for the subtree above node a, given that node a is in state i. Apply equation successively to each node in the tree, working downwards. Finally compute all S 0 (i) and use previous eq. to find minimum cost for whole tree.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik26 Sankoff algorithm The array (6,6,7,8) at the bottom of the tree has a minimum value of 6 = minimum total cost of the tree for this site.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik27 Finding the best tree by heuristic search The obvious method for searching for the most parsimonious tree is to consider ALL trees and evaluate each one. Unfortunately, generally the number of possible trees is too large.  use heuristic search methods that attempt to find the best trees without looking at all possible trees. (1) Make an initial estimate of the tree and make small rearrangements of it = find „neighboring“ trees. (2) If any of these neighbors are better, consider them and continue search.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik28 Distance matrix methods introduced by Cavalli-Sforza & Edwards (1967) and by Fitch & Margoliash (1967) general idea „seems as if it would not work very well“ (Felsenstein): - calculate a measure of the distance between each pair of species - find a tree that predicts the observed set of distances as closely as possible. All information from higher-order combinations of character states is left out. But computer simulation studies show that the amount of lost information is remarkably small. Best way to think about distance matrix methods: consider distances as estimates of the branch length separating that pair of species.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik29 Least square method - observed table (matrix) of distances D ij - any particular tree leads to a predicted set of distances d ij.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik30 Least square method Measure of the discrepancy between the observed and expected distances: where the weights w ij can be differently defined: - w ij = 1 (Cavalli&Sforza, 1967) - w ij = 1/D ij 2 (Fitch&Margoliash, 1967) - w ij = 1/D ij (Beyer et al., 1974) Aim: Find tree topology and branch lengths that minimize Q. Equation above is quadratic in branch lengths. Take derivative with respect to branch lengths, set = 0, and solve system of linear equations. Solution will minimize Q. Doug Brutlag‘s course

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik31 Least square method Number species in alphabetical order. The expected distance between species A and D d 14 = v 1 + v 7 + v 4 The expected distance between speices B and E d 25 = v 5 + v 6 + v 7 + v 2. v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik32 Least square method Number all branches of the tree and introduce an indicator variable x ijk : x ijk = 1 if branch k lies in the path from species i to species j x ijk = 0 otherwise. The expected distance between i and j will then be and For the case with w ij = 1  ij. Note: these are k equations for each of the k branches.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik33 Least square method D AB + D AC + D AD + D AE = 4v 1 + v 2 + v 3 + v 4 + v 5 + 2v 6 + 2v 7 D AB + D BC + D BD + D BE = v 1 + 4v 2 + v 3 + v 4 + v 5 + 2v 6 + 3v 7 D AC + D BC + D CD + D CE = v 1 + v 2 + 4v 3 + v 4 + v 5 + 3v 6 + 2v 7 D AD + D BD + D CD + D DE = v 1 + v 2 + v 3 + 4v 4 + v 5 + 2v 6 + 3v 7 D AE + D BE + D CE + D DE = v 1 + v 2 + v 3 + v 4 + 4v 5 + 3v 6 + 2v 7 D AC + D AE + D BC + D BE + D CD + D DE = 2v 1 + 2v 2 + 3v 3 + 2v 4 + 3v 5 + 6v 6 + 4v 7 D AB + D AD + D BC + D CD + D BE + D DE = 2v 1 + 3v 2 + 2v 3 + 3v 4 + 2v 5 + 4v 6 + 6v 7 Stack up the ( = 10) D ij, in alphabetical order, into a vector and the coefficients x ijk are arranged in a matrix X with each row corresponding to the D ij in the row of d and containing a 1 if branch k occurs on the path between species i and j.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik34 Least square method If we also stack up the 7 v i into a vector v, the previous set of linear equations can be compactly expressed as: Multiplied from the left by the inverse of X T X one can solve for the least squares branch lengths This is a standard method of expressing least squares problems in matrix notation and solving them. check for example :-)

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik35 Least square method When we have weighted least squares, with a diagonal matrix of weights in the same order as the D ij : then the least square equations can be written and their solution

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik36 Finding the least squares tree topology Now that we are able to assign branch lengths to each tree topology. we need to search among tree topologies. This can be done by the same methods of heuristic search that were presented for the Maximum Parsimony method. Note: no-one has sofar presented a branch-and-bound method for finding the least squares tree exactly. Day (1986) has shown that this problem is NP-complete. The search is not only among tree topologies, but also among branch lengths.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik37 neighbor-joining method introduced by Saitou and Nei (1987) – algorithm works by clustering - does not assume a molecular clock but approximates the „minimum evolution“ model. „Minimum evolution“ model: among possible tree topologies, choose the one with minimal total branch length. Neighbor-joining, as the least-squares method, is guaranteed to recover the true tree if the distance matrix is an exact reflection of the tree.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik38 neighbor-joining method (1) For each tip, compute (2) Choose the i and j for which D ij – u i – u j is smallest. (3) Join items i and j. Compute the branch length from i to the new node (v i ) and from j to the new node (v j ) as (4) Compute distance between the new node (ij) and each of the remaining tips as (5) Delete tips i and j from the tables and replace them by the new node, (ij), which is now treated as a tip. (6) If more than 2 nodes remain, go back to step (1). Otherwise, connect the two remaining nodes (e.g. l and m) by a branch of length D lm.

4. Vorlesung WS 2005/06Softwarewerkzeuge der Bioinformatik39 limitation of distance methods Distance matrix methods are the easiest phylogeny method to program, and they are very fast. Distance methods have problems when the evolutionary rates vary largely. One can correct for this in distance methods as well as in likelihood methods. When variation of rates is large, these corrections become important. In likelihood methods, the correction can use information from changes in one part of the tree to inform the correction in others. Once a particular part of the molecule is seen to change rapidly in the primates, this will affect the interpretation of that part of the molecule among the rodents as well. But a distance matrix method is inherently incapable of propagating the information in this way. Once one is looking at changes within rodents, it will forget where changes were seen among primates.