Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Molecular Evolution Revised 29/12/06
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Introduction to Bioinformatics Algorithms Sequence Alignment.
15-853:Algorithms in the Real World
CIS786, Lecture 4 Usman Roshan.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Phylogenetic trees Sushmita Roy BMI/CS 576
Sequence Alignment.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Post-processing long pairwise alignments 陳啟煌 93/4/28 Zheng Zhang et al., Bioinformatics Vol.15 no
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Sequence Alignment.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Homology Search Tools Kun-Mao Chao (趙坤茂)
SMA5422: Special Topics in Biotechnology
Sequence Alignment Kun-Mao Chao (趙坤茂)
CS 581 Tandy Warnow.
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Sequence Alignment Kun-Mao Chao (趙坤茂)
Algorithms for Inferring the Tree of Life
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5

2 Sequence Alignment

Dot Matrix Sequence A : CTTAACT Sequence B : CGGATCAT 3 C G G A T C A T CTTAACTCTTAACT

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: 4 C---TTAACT CGGATCA--T Sequence A Sequence B

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: 5 C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap

Alignment Graph Sequence A: CTTAACT Sequence B: CGGATCAT 6 C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) 7 C T T A A C T C G G A T C A - - T = +12 Alignment score

An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows. 8

Computing S i,j 9 i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

Initializations C G G A T C A T CTTAACTCTTAACT

S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT

S 3,5 = C G G A T C A T CTTAACTCTTAACT optimal score

C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment ? 14

Initializations G A A T C T G C CAATTGACAATTGA

S 4,2 = ? ? G A A T C T G C CAATTGACAATTGA

S 5,5 = ? ? G A A T C T G C CAATTGACAATTGA

S 5,5 = G A A T C T G C CAATTGACAATTGA optimal score

C A A T - T G A G A A T C T G C G A A T C T G C CAATTGACAATTGA = 27

Global Alignment vs. Local Alignment global alignment : local alignment : 20

An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows. 21

local alignment ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

local alignment C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3 The best score

C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T = 18

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment ? 25

Did you get it right? G A A T C T G C CAATTGACAATTGA

G A A T C T G C CAATTGACAATTGA A A T – T G A A T C T G = 37

Affine gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) Each gap is charged an extra gap-open penalty: C T T A A C T C G G A T C A - - T = Alignment score: 12 – 4 – 4 = 4

Affine gap panalties A gap of length k is penalized x + k·y. 29 gap-open penalty gap-symbol penalty Three cases for alignment endings: 1....x...x 2....x x an aligned pair a deletion an insertion

Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j. 30

Affine gap penalties 31 (A gap of length k is penalized x + k·y.)

Affine gap penalties 32 SI D SI D SI D SI D -y -x-y -y w(a i,b j )

Constant gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: 0 (w(-,x)=w(x,-)=0) Each gap is charged a constant penalty: C T T A A C T C G G A T C A - - T = Alignment score: 27 – 4 – 4 = 19

Constant gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j. 34

Constant gap penalties 35

Restricted affine gap panalties A gap of length k is penalized x + f(k)·y. where f(k) = k for k c 36 Five cases for alignment endings: 1....x...x 2....x x 4.and 5. for long gaps an aligned pair a deletion an insertion

Restricted affine gap penalties 37

D(i, j) vs. D’(i, j) Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length = D’(i, j) Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j) 38

Max{S(i,j)-x-ky, S(i,j)-x-cy} 39 k c

k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) BLAST (Altschul et al., 1990; Altschul et al., 1997) 40

FASTA 1)Find runs of identities, and identify regions with the highest density of identities. 2)Re-score using PAM matrix, and keep top scoring segments. 3)Eliminate segments that are unlikely to be part of the alignment. 4)Optimize the alignment in a band. 41

42 FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence A Sequence B

43 FASTA Step 2: Re-score using PAM matrix, and keep top scoring segments.

44 FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

45 FASTA Step 4: Optimize the alignment in a band.

BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. 46

The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) 47 the highest scoring pair The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score.

BLAST 1)Build the hash table for Sequence A. 2)Scan Sequence B for hits. 3)Extend hits. 48

49 BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT AAA AAC.. AGA 1.. ATC 3.. CGA 5.. GAT TCG 4.. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

50 BLAST Step2: Scan sequence B for hits.

51 BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in both FASTA and BLAST. 52

53 Phylogenetic trees

Benjamin Loyle 2004 Cse mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTT AGCGCTT AGCACAAAGGGCAT TAGCCCTAGCACTT DNA Sequence Evolution

Benjamin Loyle 2004 Cse 397 Problem Definition The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings Even smaller relations are tough Impossible Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397 So what…. Genome sequencing provides entire map of a species, why link them? We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution

Benjamin Loyle 2004 Cse 397 Why is that a problem? Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes years Error is a very large factor

Benjamin Loyle 2004 Cse 397 What do we want? Input A collection of nodes such as taxa or protein strings to compare in a tree Output A topological link to compare those nodes to each other When do we want it? FAST!

Benjamin Loyle 2004 Cse 397 Preparing the input Create a distance matrix Sum up all of the known distances into a matrix sized n x n N is the number of nodes or taxa Found with sequence comparison

Benjamin Loyle 2004 Cse 397 Distance Matrix Take 5 separate DNA strings A : GATCCATGA B : GATCTATGC C : GTCCCATTT D : AATCCGATC E : TCTCGATAG The distance between A and B is 2 The distance between A and C is 4 This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397 Distance Matrix Lets start with an example matrix A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 Lets make it simple (constrain the input) Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are the most dissimilar of any nodes. This is called the diameter of the tree Lets keep the length of the input (length of the strings) polynomial.

Benjamin Loyle 2004 Cse 397 ERROR?!?!!? All trees are inferred, how do you ever know if you’re right? How accurate do we have to be? We can create data sets to test trees that we create and assume that it will then work in the real world

Benjamin Loyle 2004 Cse 397 Data Sets JC Model Sites evolve independent Sites change with the same probability Changes are single character changes Ie. A -> G or T -> C The expectation of change is a Poisson variable (e)

Benjamin Loyle 2004 Cse 397 More Data Sets K2P Model Based on JC Model Allows for probability of transitions to tranversions It’s more likely for A and T to switch and G and C to switch Normally set to twice as likely

Benjamin Loyle 2004 Cse 397 Data Use Using these data sets we can create our own evolution of data. Start with one “ancestor” and create evolutions Plug the evolutions back and see if you get what you started with

Benjamin Loyle 2004 Cse 397 Aspects of Trees Topology The method in which nodes are connected to each other “Are we really connected to apes directly, or just linked long before we could be considered mammals?” Distance The sum of the weighted edges to reach one node from another

Benjamin Loyle 2004 Cse 397 What can distance tell us? The distance between nodes IS the evolutionary distance between the nodes The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

Benjamin Loyle 2004 Cse 397 Current Techniques Maximum Parsimony Minimize the total number of evolutionary events Find the tree that has a minimum amount of changes from ancestors Maximum Likelihood Probability based Which tree is most probable to occur based on current data

Benjamin Loyle 2004 Cse 397 More Techniques Neighbor Joining Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization It shrinks the distance matrix by considering two ‘neighbors’ as one node

Benjamin Loyle 2004 Cse 397 Learning Neighbor Joining It will become apparent later on, but lets learn how to do Neighbor Joining (NJ) A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 NJ Part 1 First start with a “star tree” A BC D E

Benjamin Loyle 2004 Cse 397 NJ Part 2 Combine the closest two nodes (from distance matrix) In our case it is node A and B at distance 3 A BC D E

Benjamin Loyle 2004 Cse 397 NJ Part 3 Repeat this until you have added n-2 nodes (3) N-2 will make it a binary tree, so we only have to include one more node. A BC D E

Benjamin Loyle 2004 Cse 397 Are we done? ML and MP, even in heuristic form take too long for large data sets NJ has poor topological accuracy, especially for large diameter trees We need something that works for large diameter trees and can be run fast.

Benjamin Loyle 2004 Cse 397 Here’s what we want Our Goal An “Absolute Fast Converging” Method  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{ (e)}) is in the set M f,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[  (S) = T] > 1- €. Simply: Lets make it in polynomial time within a degree of error.

Benjamin Loyle 2004 Cse 397 A DCM* - NJ Solution 2 Phase construction of a final phylogenetic tree given a distance matrix d. Phase 1 : Create a set of plausible trees for the distance matrix Phase 2 : Find the best fitting tree

Benjamin Loyle 2004 Cse 397 Phase 1 For each q in {d ij }, compute a tree t q Let T = { t q : q in {d ij } }

Benjamin Loyle 2004 Cse 397 Finding t q Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all maximal cliques Step 4: Merge the subtrees into a supertree

Benjamin Loyle 2004 Cse 397 What does that mean Breaking the problem up Create a threshold of diameters to break the problem into A bunch of smaller diameter trees (cliques) Apply NJ to those cliques Merge them back

Benjamin Loyle 2004 Cse 397 Finding t q (terms) Threshold Graph Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if d ij <= q.

Benjamin Loyle 2004 Cse 397 Threshold Lets bring back our distance matrix and create a threshold with q equal to d 15 or the distance between A and E So q = 67

Benjamin Loyle 2004 Cse 397 Distance Matrix Our old example matrix A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 With q = D 15 = 67 A B C D E

Benjamin Loyle 2004 Cse 397 Triangulating A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive vertices of the cycle. Our example is already triangulated, but lets look at another

Benjamin Loyle 2004 Cse 397 Triangulating WX YZ Lets say this is for q = and 15 would Not be in the graph To triangulate this graph you add the edge length 10.

Benjamin Loyle 2004 Cse 397 Maximal Cliques A clique that cannot be enlarged by the addition of another vertex. Recall our original threshold graph which is triangulated:

Benjamin Loyle 2004 Cse 397 Triangulated Threshold Graph Our old Graph A B C D E

Benjamin Loyle 2004 Cse 397 Clique Our maximal cliques would be: {A, B, E} {C, D}

Benjamin Loyle 2004 Cse 397 Create Trees for the Cliques We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?

Benjamin Loyle 2004 Cse 397 Tree {A, B, E} and {C,D} A B E CD

Benjamin Loyle 2004 Cse 397 Merge your separate trees together. Create one Supertree This is done by creating a minimum set of edges in the trees and calling that the “backbone” This is it’s own doctorial thesis, so lets do a little hand waving

Benjamin Loyle 2004 Cse 397 That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance. Maximal cliques is only polynomial if the data input is triangulated (which it is!). If all previous are done, creating a supertree can be done in polynomial time as well.

Benjamin Loyle 2004 Cse 397 Where are we now? We now have a finalized phylogeny created for from smaller trees in our matrix joined together Remember we started from all possible size of smaller trees.

Benjamin Loyle 2004 Cse 397 Phase 2 Which one is right? Found using the SQS (Short Quartet Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc Reduce the larger tree to only hold “one set” These are called Quartets

Benjamin Loyle 2004 Cse 397 SQS - A Guide Q(T) is the set of trees induced by T on each set of four leaves. Let Q w (different Q) be a set of quartets with diameter less than or equal to w Find the maximum w where the quartets are inclusive of the nodes of the tree This w is the “support” of that tree

Benjamin Loyle 2004 Cse 397 SQS - Refrased Q w is the set of quartet trees which have a diameter <= w Support of T is the max w where Q w is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,

Benjamin Loyle 2004 Cse 397 Qw = AB C D AB DE ABCDABCDEE

Benjamin Loyle 2004 Cse 397 SQS Method Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the tree found first. This is the tree with the smallest original diameter (remember from phase 1)

Benjamin Loyle 2004 Cse 397 How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy Remove one edge in the tree we’ve created. We now have two trees Is there anyway to create the same set of leaves by removing one edge in our data set? If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not identical

Benjamin Loyle 2004 Cse 397 Performance of DCM * - NJ Outperforms NJ method at sequence lengths above 4000 and with more taxa. NJ DCM-NJ No. Taxa Error Rate

Benjamin Loyle 2004 Cse 397 Improvements Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP) Try and minimize the overall size of the tree Test using statistical evidence Maximum Likelihood (ML)

Benjamin Loyle 2004 Cse 397 Performance gains Simply changing Phase 2 has massive gains in accuracy! DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard. DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

Benjamin Loyle 2004 Cse 397 Comparing Improvements DCM-NJ+SQS NJ DCM-NJ+MP HGT-FP # leaves Error Rate