Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.

Slides:



Advertisements
Similar presentations
Computational Molecular Biology Biochem 218 – BioMedical Informatics Doug Brutlag Professor.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics Algorithms and Data Structures
Phylogeny Tree Reconstruction
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
PHYLOGENETIC TREES Dwyane George February 24,
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Evolutionary tree reconstruction
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic basis of systematics
Multiple Alignment and Phylogenetic Trees
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Phylogeny.
Presentation transcript:

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees

Algorithms in Computational Biology22Department of Mathematics & Computer Science Phylogeny All organisms on Earth had a common ancestor Evidence from morphological, biochemical, and gene sequence data Phylogeny This history of organismal lineages as they change through time Phylogenetic tree A tree showing the evolutionary relationships among various biological species All living organisms today, from smallest microbe to the largest plants and animals, are connected by the passage of genes along the branches of the phylogenetic tree

Algorithms in Computational Biology33Department of Mathematics & Computer Science Phylogenetic Tree of Life

Algorithms in Computational Biology44Department of Mathematics & Computer Science Inferring Phylogenies Traditionally Use morphological characters (both from living and fossilized organisms) 1962 Zuckerkandl & Pauling showed that molecular sequences can be used to infer phylogenies Assumes current sequences descended from some common ancestral gene in a common ancestral species

Algorithms in Computational Biology55Department of Mathematics & Computer Science Major Tree Building Algorithms Distance based Parsimony Maximum likelihood

Algorithms in Computational Biology66Department of Mathematics & Computer Science Orthologue vs Paralogue Both of them are homologous genes (homologues) Orthologues are a set of genes diverged from a common ancestor through gene speciation Homologous genes from different species Paralogues are a set of genes diverged from a common ancestor through gene duplication Homologous genes from the same species

Algorithms in Computational Biology77Department of Mathematics & Computer Science A Tree of Orthologues A tree of orthologues based on a set of alpha hemoglobins

Algorithms in Computational Biology88Department of Mathematics & Computer Science A Tree of Paralogues

Algorithms in Computational Biology99Department of Mathematics & Computer Science Background on Trees Nodes and Edges Nodes: unobserved ancestor Edge length On average, corresponds to evolutionary time period Variations Different proteins can change at different rates Same sequence evolve much faster in some organism than others Root of a phylogenetic tree Ultimate ancestor of all species Some algorithms provides the location of the root, while other don’t

Algorithms in Computational Biology10 Department of Mathematics & Computer Science Counting and Labeling Trees Counting: For a rooted tree with n leaves As we move up the tree, the edges coalesce as each new node is reached In addition to n leaves, there are n-1 nodes (internal nodes plus root node). A total of 2n-1 nodes There will be 2n-2 edges (discounting the edge above the root node) For an unrooted tree with n leaves Total number of nodes = 2n – 2 Total number of edges = 2n – 3 Labeling (for rooted tree) Label the leaves using 1 to n Label the branch nodes using n+1 to 2n-2 Label the root using 2n-1

Algorithms in Computational Biology11 Department of Mathematics & Computer Science Rooting an Unrooted Tree

Algorithms in Computational Biology12 Department of Mathematics & Computer Science How Many Possible Topologies? # of leavesWays to add n th leaf # of edges in the sub-tree # of un-rooted trees x5 6793x5x x5x7x9 ………… n2n-52n-33x5x7x9x…x(2n-5) (2n-5)!! # of rooted trees: (2n-3)!!

Algorithms in Computational Biology13 Department of Mathematics & Computer Science Making a Tree from Pairwise Distances Distance Measure First find f which is the fraction of differences between two sequences presupposing an alignment of the two sequences Fraction of difference expected by chance (by random substitution) is about 3/4 Jukes-Cantor distance (odds ratio) Clustering methods UPGMA Neighbor-joining

Algorithms in Computational Biology14 Department of Mathematics & Computer Science Unweighted Pair Group Method Using Arithmetic Average (UPGMA) [Sokal & Michener, 1958] Overview 1. Cluster the sequences 2. Amalgamate two clusters at each stage, create a new node on a tree 3. Assemble the tree upwards, each node being added above the others 4. The edge length determined by the difference in the heights of the nodes at the top and bottom of an edge

Algorithms in Computational Biology15 Department of Mathematics & Computer Science Distance Measure Used in UPGMA Distance b/w two clusters C i and C j is the average distance between pairs of sequences from each other Distance b/w two clusters C k and C l, if C k is the union of two clusters C i and C j

Algorithms in Computational Biology16 Department of Mathematics & Computer Science Algorithm UPGAM Initialization Assign each sequence i to its own cluster C i Define one leaf of T for each sequence, and place at height zero Iteration Determine the two clusters i, j for which d ij is minimal (if there are ties, pick one randomly) Define a new cluster k by C k = C i  C j, and define d kl for all l using arithmetic average Define a node k with daughter nodes i and j, and place it at height d ij /2. Add k to the current clusters and remove i and j Termination When only two clusters i, j remain, place the root at height d ij /2

Algorithms in Computational Biology17 Department of Mathematics & Computer Science An Example

Algorithms in Computational Biology18 Department of Mathematics & Computer Science Cont’

Algorithms in Computational Biology19 Department of Mathematics & Computer Science Molecular Clock Assumption in UPGMA UPGMA produces a rooted tree Edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate The sum of times down a path to the leaves from any node is the same, whatever the path The distances d ij are said to be ultrametric, if for any triplet of sequences, x i, x j, x k, the distances d ij, d jk, d ik are either all equal, or two are equal and the remaining one is smaller True for a tree with a molecular clock Implied additivity The edge lengths are said to be additive if the distance b/w any pair of the leaves is the sum of the lengths of the edges on the path connecting them

Algorithms in Computational Biology20 Department of Mathematics & Computer Science Molecular Clocks Mutations may build up in any given stretch of DNA at a reliable rate If the rate of mutation of a gene is reliable, this gene can be used as a molecular clock This gene can be a powerful tool for estimating the dates of lineage-splitting events.

Algorithms in Computational Biology21 Department of Mathematics & Computer Science Example The entire length of DNA of a genes changes at a rate of approximately one base per 25 million years

Algorithms in Computational Biology22 Department of Mathematics & Computer Science What If Molecular Clock Property Fails? A tree that is reconstructed incorrectly by UPGMA (right)

Algorithms in Computational Biology23 Department of Mathematics & Computer Science Additivity Given a tree, its edge length is additive If the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them Build-in assumption in UPGMA

Algorithms in Computational Biology24 Department of Mathematics & Computer Science Test for Additivity For every set of four leaves, 1, 2, 3 and 4, two of the three distances d 12 + d 34, d 13 + d 24 and d 14 + d 23 must be equal and larger than the 3 rd

Algorithms in Computational Biology25 Department of Mathematics & Computer Science Joining a Pair of Neighboring Leaves i j k m D im = d ik + d km D jm = d jk + d km D ij = d ik + d jk D km = 0.5(d im + d jm – d ij ) Node k joins leaf nodes i and j

Algorithms in Computational Biology26 Department of Mathematics & Computer Science Closest Pairs of Leaves Are not Necessarily Neighboring Leaves d Table

Algorithms in Computational Biology27 Department of Mathematics & Computer Science Compensation for Long Edges r 1 = 0.7 r 2 = 0.7 r 3 = 1 r 4 = 1 D Table

Algorithms in Computational Biology28 Department of Mathematics & Computer Science Algorithm: Neighbor-Joining Initialization: Define T to be the set of leaf nodes, one for each given sequence, and put L = T. Iteration: Pick a pair i, j in L for which D ij is minimal Define a new node k and set d km = 0.5(d im + d jm – d ij ), for all m in L. Add k to T with edges of lengths d ik = 0.5(d ij +r i -r j ), d jk = d ij – d ik, joining k to i and j, respectively. Remove i and j from L and add k. Termination When L consists of two leaves i and j add the remaining edge between i and j, with length d ij Produces an unrooted tree

Algorithms in Computational Biology29 Department of Mathematics & Computer Science Rooting Trees Outgroup Species known to be more distantly related to each of the remaining species than they are to each other Find the root by adding an outgroup The point in the tree where the edge to the outgroup joins is expected to be the best root candidate In the absence of a convenient outgroup, methods are quite ad hoc E.g. picking the midpoint of the longest chain of consecutive edges if deviation from a molecular clock were not too great.

Algorithms in Computational Biology30 Department of Mathematics & Computer Science Assumptions Used by UPGMA and Neighbor-Join UPGMA (molecular clock with implied additivity) The edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate The divergence of sequences is assumed to occur at the same constant rate at all points in the tree The distance from an internal node to a leaf node will always be the same no matter what path is taken Neighbor-Join It is possible for the molecular clock property to fail but for additivity to hold Assume additivity only

Algorithms in Computational Biology31 Department of Mathematics & Computer Science Parsimony Most widely used tree building algorithm It works by finding the tree which can explain the observed sequences with a minimum # of substitutions Two components to the algorithm 1.The computation of a cost for a given tree T 2.A search through all trees, to find the overall minimum of this cost

Algorithms in Computational Biology32 Department of Mathematics & Computer Science Notations Used in Weighted Parsimony S k (a) denotes the minimal cost for the assignment of a to node k S(a, b): cost for each substitution of a by b

Algorithms in Computational Biology33 Department of Mathematics & Computer Science Algorithm: Weighted Parsimony Compute the minimum cost at site u [Sankoff & Cedergren 1983] Initialization: Set k = 2n – 1, the number of the root node Recursion: Compute S k (a) for all a as follows: If k is a leaf node: Set S k (a) = 0 for a = x u k, S k (a) = , otherwise If k is not leaf node: Compute S i (b), S j (b) for all b at the daughter nodes i, j and define S k (a) = min b (S i (b) + S(a, b)) + min b (S j (b) + S(a, b)). Termination: Minimal cost of tree = min a S 2n-1 (a) Weighted parsimony reduces to traditional parsimony if S(a, a) = 0 for all a, S(a, b) = 1 for all a  b

Algorithms in Computational Biology34 Department of Mathematics & Computer Science Algorithm: Traditional Parsimony [Fitch 1971] Initialization Set C = 0 and k = 2n -1 Recursion: to obtain the set R k If k is leaf node: Set R k = x u k If k is not a leaf node: Compute R i, R j for the daughter nodes i, j of k, and set R k = R i  R j if this intersection is not empty, or else R k = R i  R j and increment C Termination: Minimal cost of the tree = C

Algorithms in Computational Biology35 Department of Mathematics & Computer Science Parsimony Example {A, B} A A B A B Minimum cost = 2 Obtained by traditional parsimony A A A A B A B B A A A B A B X X X X

Algorithms in Computational Biology36 Department of Mathematics & Computer Science Cont’ B B B A B A B Minimum cost tree: not obtained by traditional parsimony

Algorithms in Computational Biology37 Department of Mathematics & Computer Science Enumeration of Unrooted Trees Enumerate all unrooted trees by an array [i 3 ] [i 5 ] [i 7 ] [i 9 ]… [i 2n-5 ] Take the unrooted tree with 3 sequences x1, x2 and x3 and add an edge for x4 on the edge labeled by i 3, since the new edge divides the preexisting edge in two, the total number of edges is now = 5. The value of i 5 determines which of these x5 is added to. Think of [i 3 ] [i 5 ] [i 7 ] [i 9 ]… [i 2n-5 ] as an odometer …

Algorithms in Computational Biology38 Department of Mathematics & Computer Science Counting Trees Cont’ Counting complete trees The rightmost numbers advance till they reach 2n-5 The next-to-rightmost array index clicks forward by 1 when the rightmost array index go back to 1 The second-to-rightmost index clicks forward by 1 when the next-to-rightmost index reaches 2n-7 And so on and so forth … Counting both complete and incomplete trees Add 0 to each array index, meaning that there is no edge of the order specified by the counter

Algorithms in Computational Biology39 Department of Mathematics & Computer Science Selecting Labeled Branching Patterns by Branch and Bound Starts from the odometer setting [1][0][0]…[0] Let the smallest cost so far for a complete tree be C Brand and bound Adding more leaves can only increase cost No point branching out if current cost is larger than the minimum cost so far Implementation trick Whenever the cost of our current subtree T is more than C, we know that T is not part of the optimal tree If all the counters to the right of a given non-zero counter are 0, instead of advancing them all to ‘1’ we can click the rightmost non- zero counter one forward

Algorithms in Computational Biology40 Department of Mathematics & Computer Science An Example of Branch-and-Bound Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5) and go directly to 3…80000 if the cost of 3…70000 is higher the the minimum cost found so far ……

Algorithms in Computational Biology41 Department of Mathematics & Computer Science Assessing the Trees: the Bootstrap Bootstrapping (sample with replacement) Given a dataset consisting an alignment of sequences, generates an artificial dataset by picking columns from the alignment at random with replacement Generate large number (order of thousands) of artificial alignment datasets For each artificially generated data set, build a tree Assessing phylogenetic features Find the frequency of each phylogenetic feature that appears in the thousands trees generated above The higher the frequency, the more confident we have with a phylogenetic feature

Algorithms in Computational Biology42 Department of Mathematics & Computer Science Describe a New Hampshire Standard Tree Tree file representation of the above rooted tree, starting at the beginning of the file: (B,(A,C,E),D); (B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);

Algorithms in Computational Biology43 Department of Mathematics & Computer Science Visualize Trees Phylip DrawTree

Algorithms in Computational Biology44 Department of Mathematics & Computer Science Visualize Trees Cladogram

Algorithms in Computational Biology45 Department of Mathematics & Computer Science Visualize Trees Phenogram

Algorithms in Computational Biology46 Department of Mathematics & Computer Science Visualize Trees Curve-O-Gram

Algorithms in Computational Biology47 Department of Mathematics & Computer Science Visualize Trees Eurogram

Algorithms in Computational Biology48 Department of Mathematics & Computer Science Programs to Build Phylogenetic Trees PAUP Include parsimony, maximum likelihood, and distance methods Phylip Include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. MrBayes Bayesian estimation of phylogeny Uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees NoTung Incorporating duplication/loss parsimony into phylogenetic tasks ……