1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
An Introduction to Phylogenetic Methods
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Problem Set 2 Solutions Tree Reconstruction Algorithms
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan WWW:
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Tutorial 5 Phylogenetic Trees.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Introduction to bioinformatics 2008 Lecture 12
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic Trees - Parsimony Tutorial #12
dij(T) - the length of a path between leaves i and j
Inferring a phylogeny is an estimation procedure.
Character-Based Phylogeny Reconstruction
Clustering methods Tree building methods for distance-based trees
Phylogenetic Trees.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

1 Chapter 7 Building Phylogenetic Trees

2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method (+ an example) –Neighbor-Joining method (+ an example) Comparison of methods Conclusion

3 Phylogeny Phylogeny is the evolution of related species/genes Phylogenetic tree: diagram showing evolutionary lineages of species/genes The history of genes or species may be very different Genes can be homologous or analogous, but still remind each other

4 Phylogeny The similarity of molecular mechanisms of the organisms that have been studied strongly suggests that all organisms on Earth had a common ancestor Any set of species is related, and this relationship is called a phylogeny The relationship can be represented by a phylogenetic tree

5 Phylogeny Traditionally, morphological characters (both from living and fossilized organisms) have been used for inferring phylogenies Zuckerkandel & Pauling (1962) showed that molecular sequences provide sets of characters that can carry a large amount information If we have a set of sequences from different species, we may be able to use them to infer a likely phylogeny of the species in question This assumes that the sequences have descended from some common ancestral gene in a common ancestral species

6 Phylogeny The widespread occurrence of gene duplication means that the foregoing assumption needs to be checked carefully The phylogentic tree of a group of seqences does not necessarily reflect the phylogenetic tree of their host species, because gene duplication is another mechanism, in addition to speciation, by which two sequences can be separated and diverge from a common ancestor Genes which diverged because of speciation

7 Phylogeny Genes which diverged because of speciation are called orthologues ( 直系同 源 ) Genes which diverged by gene duplication are called paralogues ( 平行進化同源 )

8 Phylogeny Homologous sequences can be divided into two parts –Orthologous sequences diverged by specification from a common ancestor –Paralogous sequences evolved by gene dublication within species Analogous sequences may appear and function very similarly, but they do not have a common ancestor WHEN WE WANT TO EXPLORE EVOLUTIONARY RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS SEQUENCES

9 Genes Homologous OrthologousParalogous Analogous

10 Orthologues / Paralogues

11 Orthology/paralogy Orthologous genes are homologous (corresponding) genes in different species (genomes) Paralogous genes are homologous genes within the same species (genome)

12 Phylogenetic Trees WHY construct a phylogenetic tree? –to understand lineage of various species –to understand how various functions evolved –to inform multiple alignments Trees can be rooted (a common ancestor in known) or unrooted Leaves are the terminal nodes that correspond to the observed sequences of genes or species (A, B, C, D) Internal nodes are hypothetical ancestral nodes All trees will be assumed to be binary, meaning that an edge that branches splits into two daughter edges Each edge has a certain amount of evolutionary divergence associated to it, defined by some measure of distance between sequences, or from a model of substitution of residues over the course of evolution

13

14 Phylogenetic Trees We adopt the general term “length” or “edge length” here, and represent this by the lengths of edge in the figures we draw A true biological phylogeny has a “root”, or ultimate ancestor of all the sequences The leaves of trees have names or numbers A tree with a given labelling will be called a labelled branching pattern We refer to this as the tree topology and denote it by the symbol T The lengths of its edges are denoted by t i with a suitable numbering scheme for the is

15 Rooted / Unrooted Tree

16 Types of trees Unrooted tree represents the same phylogeny without the root node Depending on the model, data from current day species often does not distinguish between different placements of the root.

17 Rooted versus unrooted trees Tree a a b Tree b c Tree c Represents all three rooted trees

18 Rrooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree

19 Counting Trees

20 Counting Trees (2N - 5)!! = # unrooted trees for N taxa (2N- 3)!! = # rooted trees for N taxa C A B D A B C A D B E C A D B E C F

21 How many trees? Number of unrooted trees = (2n-5)! / 2 n-3 (n-3)! =3x5x…x(2n-5) Number of rooted trees = (2n-3)! / 2 n-3 2(n-2)! =3x5x…x(2n-3)

22 Combinatoric explosion # sequences# unrooted# rooted trees , ,395135, ,1352,027, ,027,02534,459,425

23 Phylogenetic trees Different ways to represent a phylogenetic tree (illustrated by Treeview)

24 Making a tree from pairwise distances Distances dij between each pair of sequences i and j are calculated in the given dataset Different ways defining distances –For nucleotide sequences: Jukes-Cantor, Kimura-2- parameter K2P, HKY (Hasegawa-Kishino-Yano), F84, Tamura-Nei, General time- reversible model, General 12- parameter model –For amino acid sequences: PAM-matrices, BLOSUM- matrices ABCD A B C D

25 Distance matrix methods UPGMA –Algorithm introduced by Sokal and Michener 1958 Neighbor-Joining –Algorithm introduced by Saitou and Nei 1987 –Modified by Studier and Keppler 1988

26 Clustering method: UPGMA UPGMA = Unweighted pair group method using arithmetic averages Simple method It works by clustering the sequences, at each stage connecting two clusters and finally creating a new node on a tree Method assumes equal rate of evolutionary change along branches  Molecular clock assumption

27 UPGMA UPGMA produces a rooted tree Branch lengths satisfy a molecular clock  The divergence of sequences is assumed to occur at the same constant rate at all points in the tree Trees that are clocklike are rooted and the total branch length from the root up to any leaf is equal Trees are often referred to be ultrametric A distance measures are ultrametric if either all three distances are equal d ij = d ik = d jk or two of them are equal and one is smaller: d jk < d ij = d ik  UPGMA is guaranteed to build the correct tree if distances are ultrametric Method can be used for reconstructing phylogenies if evolutionary rates are assumed to be same in all lineages  criticism in the phylogeny literature –Suitable for the species closely related Running time O(n 2 ) A C B D

28 Algorithm: UPGMA Initialisation: Assign each sequence i in dataset to its own cluster Define one leaf of T for each sequence, and place at height zero Iteration: Find the two clusters i and j for which d ij is the smallest (pick randomly if several equal distances) Define a new cluster ij by C ij = C i U C j. Cluster ij has n ij = n i + n j members ( initially ni = 1 ) Connect i and j on the tree to a new node v The branch lengths from new node to i and j are placed at height

29 Algorithm: UPGMA (cont.) Iteration (cont.) Compute the distances between the new cluster and the remaining clusters by using Add ij to the current clusters and remove i and j Termination: When only two clusters i and j remain, place the root at height

30 UPGMA -- Unweighted Pair Group Method with Arithmetic mean simplest method - uses sequential clustering algorithm (assumption of rate constancy among lineages - often violated) A B B dAB C dACdBC ( AB) C d(AB)C d(AB)C = (dAC + dAB) / 2 Distance matrix Tree dAB / 2 A B A d(AB)C / 2 B C step 1 step 2

31 UPGMA -- Ilustrations

32 An example UPGMA (1) Distance matrix (arbitrary) for four items (sequences) A, B, C and D Actually distances are not ultrametric, because three distances are not equal d ij ≠ d ik ≠ d jk or two of them are not equal and one is smaller: d jk < d ij ≠ d ik ABCD A08712 B80914 C79011 D Step 1. Find the smallest distance, d ij, between two clusters  A and C, where d ij is 7

33 An example UPGMA (2) Step 2. Define new cluster ij, which has nij = ni + nj members (initially ni = 1) New cluster  A and C nAC = nA+ nC=2 Step 3. Connect A and C on the tree to a new node v1 Step 4. The branch lengths from new node v1 to A and C A C 3,5 ABCD A08712 B0914 C011 D0

34 An example UPGMA (3) Step 5. Compute the distances between the new cluster AC and the remaining clusters (B and D):

35 Step 6. Delete the columns and rows of the distance matrix that correspond to clusters A and C, and add a column and a row for cluster AC An example UPGMA (4) ACBD B014 D0  New distance matrix

36 An example UPGMA (5) ACBD B014 D0 2nd iteration process Step 1. Find the two sequences i and j for which dij is the smallest (randomly if several equal distances)  AC-B Step 2. Define new cluster (ij), which has nij = ni + nj members ( initially ni = 1 ) New cluster  AC and B  nACB = nAC+ nB = = 3 Step 3. Connect AC and B on the tree to a new node v2 Step 4. The branch lengths from new node v2 to AC and B  A C B 4.25

37 An example UPGMA (6) Step 5. Compute the distances between the new cluster and the remaining cluster (D) Step 6. Delete the columns and rows of the distance matrix that correspond to clusters AC and B, and add a column and a row for cluster ACB ACBD D0  New distance matrix

38 An example UPGMA (7) Termination: Only two clusters (ACB and D) remaining Place the root height ACBD D0 A C B D Original distance matrix and final phylogenetic tree(including the branch lengths) 1.92 ABCD A08712 B0914 C011 D0 0.75

39 When UPGMA fails …

40 When UPGMA fails … The closest leaves are not neighboring leaves; they do not have a common parent node A test of whether reconstruction is likely to be correct is the ultrametric condition A distance measures are ultrametric if either all three distances are equal d ij = d ik = d jk or two of them are equal and one is smaller: d jk < d ij = d ik

41 Ultrametric Distances Given three leaves, two distances are equal while a third is smaller: d(i,j)  d(i,k) = d(j,k) a+a  a+b = a+b a a b i j k nodes i and j are at same evolutionary distance from k – the dendrogram will therefore have ‘aligned’ leaves; i.e. they are all at the same distance from root

42 Evolutionary clock speeds Uniform clock: Ultrametric distances lead to identical distances from root to leaves Non-uniform evolutionary clock: leaves have different distances to the root -- an important property is that of additive trees. These are trees where the distance between any pair of leaves is the sum of the lengths of edges connecting them. Such trees obey the so-called 4-point condition (next slide).

43 Additivity Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them This property is built in automatically as the UMGMA tree is constructed It is possible for the molecular clock property to fail but for additivity to hold, and in that case there are algorithms that can be used to reconstruct the tree correctly

44 Neighbor Joining Very popular method Does not make molecular clock assumption : modified distance matrix constructed to adjust for differences in evolution rate of each taxon Produces unrooted tree Assumes additivity: distance between pairs of leaves = sum of lengths of edges connecting them Like UPGMA, constructs tree by sequentially joining subtrees

45 Neighbor Joining: Once we know the correct (i,j) pair

46 d im =d ik +d km d jm =d jk +d km d im +d jm =d ik +d jk +2d km =d ij +2d km d km =(d im +d jm -d ij )/2

47 Neighbour Joining: why not pick the smallest (i,j) pair?

48 Neighbour Joining(3) i j

49 Neighbour Joining: Algorithm

50 Neighbor-Joining: Complexity The method performs a search using time O(n 2 ) and using time O(n 2 ) to update distance matrix. Giving a total time complexity of O(n 3 ),and a space complexity of O(n 2 ).

51 Neighbor-Joining We can use neighboring-joining even lengths are not additive, but reconstruction of the correct tree is no longer guaranteed We can test for additivity For every set of four leaves, i, j, k, and l, two of the distances d ij +d kl, d ik +d jl and d il +d jk must be equal and larger than the third d ij +d kl = d ik +d jl > d il +d jk

52 Additivity

53 Additivity Theorem: A set M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l)

54 Additive trees All distances satisfy 4-point condition: For all leaves i,j,k,l: d(i,j) + d(k,l)  d(i,k) + d(j,l) = d(i,l) + d(j,k) (a+b)+(c+d)  (a+m+c)+(b+m+d) = (a+m+d)+(b+m+c) i j k l a b m c d Result: all pairwise distances obtained by traversing the tree

55 Step 1. Compute for each row in distance matrix Step 2. Compute (the lower-diagonal matrix) and choose the smallest (most negative) An example N-J (1) ABCDStep 1 - r i A 08712=(8+7+12)/(4-2) = 13.5 B 80914=(8+9+14)/(4-2)=15.5 C 79011=(7+9+11)/(4-2)=13.5 D =( )/(4-2)=18.5 ABCD A B 8-( )= C 7-( )=-209-( )= D 12-( )=-2014-( )=-2011-( )=-210

56 An example N-J (2) Step 3. Join A and B together with a new node v 1. Compute the edge lengths, from A to node v and from B to node v 1 Step 4. Compute distances between the new node v 1 and remaining items (C and D) v1v1 B A 5 3

57 An example N-J (3) Step 5. Delete A and B from the distance matrix and replace them by new item AB Step 6. Continue from step 1, because more than two items remain Step 1. Compute for each row in distance matrix Step 2 Compute and choose the smallest (the lower-diagonal matrix) ABCD Step 1 = r i AB 049(4+9)/1=13 C 4011(4+11)/1=15 D 9110(9+11)/1=20 New reduced distance matrix ABCD 049 C 4-(13+15)= D 9-(13+20)=-2411-(15+20)=-240

58 An example N-J (4) Step 3 Join v 1 and C together with a new node v 2. Compute the edge lengths, from v 1 to node v 2 and from C to node v 2 Step 4 Compute distances between the new node v2 and remaining items (D) ABCD Step 1 = u i AB 049(4+9)/1=13 C 4011(4+11)/1=15 D 9110(9+11)/1=20 v1 B A 5 3 v2 1 3 C

59 An example N-J (5) Step 5 Delete AB and C from the distance matrix and replace them by ABC Step 6 Only two nodes remaining  connect them ABCD 08 D 0 B A 5 3 C D 8 ABCD A08712 B0914 C011 D0 1 3 Original distance matrix and final phylogenetic tree (including the edge lengths)

60 Comparison UPGMA –The total branch length from the root up to any leaf is equal –Produces a rooted tree, where the root is hypothesized ancestor of the sequences in the tree –Suitable for closely related sequences –Can be used to infer phylogenies if one can assume that evolutionary rates are the same in all lineages Neighbor-joining –Unrooted tree, where the direction of evolution is unknown –Suitable for datasets with largely varying rates of evolution –Suitable for large datasets B A 5 3 C D A C B D

61 Comparison UPGMA method constructs a rooted phylogenetic tree correctly if there is a molecular clock with a constant rate of mutation UPGMA method is rarely used, because molecular clock assumption is not generally true: selection pressures vary across time periods, genes within organisms, organisms, regions within gene N-J method produces an unrooted tree without molecular clock hypothesis N-J method is one of the most popular and widely used by molecular evolutionist Distance methods are strongly dependent on the model of evolution used Sequence information is reduced when transforming sequence data into distances Distance methods are computationaly fast

62 Parsimony Find the tree which can explain the observed sequences with a minimal number of substitutions It assigns a cost to a tree, and it is necessary to search through all topologies, or to pursue a more efficient search strategy that achieves this effect, in order to identify the ‘best’ tree

63 Parsimony The computation of a cost for a given tree A search through all trees, to find the overall minimum of this cost Suppose we have the following four aligned nucleotide sequences: AAG AAA GGA AGA

64 Parsimony

65 Cost of Evaluating Parsimony Score is evaluated on each position independetly. Scores are then summed over all positions. If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

66 Evaluating Parsimony Scores How do we compute the Parsimony score for a given tree? Traditional Parsimony –Each base change has a cost of 1 Weighted Parsimony –Each change is weighted by the score c(a,b)

67 Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a

68 Traditional Parsimony

69 Traditional Parsimony There is a traceback procedure for finding ancestral assignments in traditional parsimony We choose a residue from R 2n-1, then proceed down the tree Having chosen a residue from the set R k, we pick the same residue from the daughter set R i if possible, and otherwise pick a residue at random from R i

70 Traditional Parsimony is not “complete”

71 Weighted Parsimony

72 Example AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

73 Parsimony & Distance Sequences Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t humanx mouse2 x fugu4 4 x Drosophila5 5 3 x human mousefuguDrosophila fugu mouse human Drosophila fugu mouse human parsimony distance

74 How to assess confidence in tree Distance method – bootstrap: –Select multiple alignment columns with replacement –Recalculate tree –Compare branches with original (target) tree –Repeat times, so calculate different trees –How often is branching (point between 3 nodes) preserved for each internal node? –Uses samples of the data

75 The Bootstrap -- example C V K V I Y S M A V R - I F S M C L R L L F T V K V S I I S I V R V S I I S I L R L T L L T L Original Scrambled x2x 3x3x Non- supportive