Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.

Slides:

Advertisements

Similar presentations

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.

Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.

Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.

Phylogenetic reconstruction

Molecular Evolution Revised 29/12/06

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

CIS786, Lecture 3 Usman Roshan.

Based on the paper by D.Huson, S.Nettles, T.Warnow

Probabilistic methods for phylogenetic trees (Part 2)

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

CIS786, Lecture 4 Usman Roshan.

Phylogenetic trees Sushmita Roy BMI/CS 576

Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Gene expression & Clustering (Chapter 10)

Physical Mapping of DNA Shanna Terry March 2, 2004.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

BINF6201/8201 Molecular phylogenetic methods

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

PRESENTED BY SUNIL MANJERI Maximum sub-triangulation in pre- processing phylogenetic data Anne Berry * Alain Sigayret * Christine Sinoquet.

Calculating branch lengths from distances. ABC A B C----- a b c.

Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.

394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.

Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.

Phylogeny Ch. 7 & 8.

Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.

SupreFine, a new supertree method Shel Swenson September 17th 2009.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.

1 Closures of Relations Based on Aaron Bloomfield Modified by Longin Jan Latecki Rosen, Section 8.4.

Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.

Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

Distance-based phylogeny estimation

Phylogenetic basis of systematics

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Distance based phylogenetics

Inferring a phylogeny is an estimation procedure.

Inferring phylogenetic trees: Distance and maximum likelihood methods

BNFO 602 Phylogenetics Usman Roshan.

Absolute Fast Converging Methods

CS 581 Tandy Warnow.

Lecture 7 – Algorithmic Approaches

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Algorithms for Inferring the Tree of Life

Closures of Relations Epp, section 10.1,10.2 CS 202.

Presentation transcript:

Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397 Table of Contents Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements

Benjamin Loyle 2004 Cse 397 From the Tree of the Life Website, University of Arizona Orangutan GorillaChimpanzee Human Phylogeny

Benjamin Loyle 2004 Cse mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTT AGCGCTT AGCACAAAGGGCAT TAGCCCTAGCACTT DNA Sequence Evolution

Benjamin Loyle 2004 Cse 397 Problem Definition The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings Even smaller relations are tough Impossible Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397 So what…. Genome sequencing provides entire map of a species, why link them? We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution

Benjamin Loyle 2004 Cse 397 Why is that a problem? Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes years Error is a very large factor

Benjamin Loyle 2004 Cse 397 What do we want? Input A collection of nodes such as taxa or protein strings to compare in a tree Output A topological link to compare those nodes to each other When do we want it? FAST!

Benjamin Loyle 2004 Cse 397 Preparing the input Create a distance matrix Sum up all of the known distances into a matrix sized n x n N is the number of nodes or taxa Found with sequence comparison

Benjamin Loyle 2004 Cse 397 Distance Matrix Take 5 separate DNA strings A : GATCCATGA B : GATCTATGC C : GTCCCATTT D : AATCCGATC E : TCTCGATAG The distance between A and B is 2 The distance between A and C is 4 This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397 Distance Matrix Lets start with an example matrix A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 Lets make it simple (constrain the input) Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are the most dissimilar of any nodes. This is called the diameter of the tree Lets keep the length of the input (length of the strings) polynomial.

Benjamin Loyle 2004 Cse 397 ERROR?!?!!? All trees are inferred, how do you ever know if you’re right? How accurate do we have to be? We can create data sets to test trees that we create and assume that it will then work in the real world

Benjamin Loyle 2004 Cse 397 Data Sets JC Model Sites evolve independent Sites change with the same probability Changes are single character changes Ie. A -> G or T -> C The expectation of change is a Poisson variable (e)

Benjamin Loyle 2004 Cse 397 More Data Sets K2P Model Based on JC Model Allows for probability of transitions to tranversions It’s more likely for A and T to switch and G and C to switch Normally set to twice as likely

Benjamin Loyle 2004 Cse 397 Data Use Using these data sets we can create our own evolution of data. Start with one “ancestor” and create evolutions Plug the evolutions back and see if you get what you started with

Benjamin Loyle 2004 Cse 397 Aspects of Trees Topology The method in which nodes are connected to each other “Are we really connected to apes directly, or just linked long before we could be considered mammals?” Distance The sum of the weighted edges to reach one node from another

Benjamin Loyle 2004 Cse 397 What can distance tell us? The distance between nodes IS the evolutionary distance between the nodes The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

Benjamin Loyle 2004 Cse 397 Current Techniques Maximum Parsimony Minimize the total number of evolutionary events Find the tree that has a minimum amount of changes from ancestors Maximum Likelihood Probability based Which tree is most probable to occur based on current data

Benjamin Loyle 2004 Cse 397 More Techniques Neighbor Joining Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization It shrinks the distance matrix by considering two ‘neighbors’ as one node

Benjamin Loyle 2004 Cse 397 Learning Neighbor Joining It will become apparent later on, but lets learn how to do Neighbor Joining (NJ) A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 NJ Part 1 First start with a “star tree” A BC D E

Benjamin Loyle 2004 Cse 397 NJ Part 2 Combine the closest two nodes (from distance matrix) In our case it is node A and B at distance 3 A BC D E

Benjamin Loyle 2004 Cse 397 NJ Part 3 Repeat this until you have added n-2 nodes (3) N-2 will make it a binary tree, so we only have to include one more node. A BC D E

Benjamin Loyle 2004 Cse 397 Are we done? ML and MP, even in heuristic form take too long for large data sets NJ has poor topological accuracy, especially for large diameter trees We need something that works for large diameter trees and can be run fast.

Benjamin Loyle 2004 Cse 397 Here’s what we want Our Goal An “Absolute Fast Converging” Method  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{ (e)}) is in the set M f,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[  (S) = T] > 1- €. Simply: Lets make it in polynomial time within a degree of error.

Benjamin Loyle 2004 Cse 397 A DCM* - NJ Solution 2 Phase construction of a final phylogenetic tree given a distance matrix d. Phase 1 : Create a set of plausible trees for the distance matrix Phase 2 : Find the best fitting tree

Benjamin Loyle 2004 Cse 397 Phase 1 For each q in {d ij }, compute a tree t q Let T = { t q : q in {d ij } }

Benjamin Loyle 2004 Cse 397 Finding t q Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all maximal cliques Step 4: Merge the subtrees into a supertree

Benjamin Loyle 2004 Cse 397 What does that mean Breaking the problem up Create a threshold of diameters to break the problem into A bunch of smaller diameter trees (cliques) Apply NJ to those cliques Merge them back

Benjamin Loyle 2004 Cse 397 Finding t q (terms) Threshold Graph Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if d ij <= q.

Benjamin Loyle 2004 Cse 397 Threshold Lets bring back our distance matrix and create a threshold with q equal to d 15 or the distance between A and E So q = 67

Benjamin Loyle 2004 Cse 397 Distance Matrix Our old example matrix A B C D E ABCDE

Benjamin Loyle 2004 Cse 397 With q = D 15 = 67 A B C D E

Benjamin Loyle 2004 Cse 397 Triangulating A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive vertices of the cycle. Our example is already triangulated, but lets look at another

Benjamin Loyle 2004 Cse 397 Triangulating WX YZ Lets say this is for q = and 15 would Not be in the graph To triangulate this graph you add the edge length 10.

Benjamin Loyle 2004 Cse 397 Maximal Cliques A clique that cannot be enlarged by the addition of another vertex. Recall our original threshold graph which is triangulated:

Benjamin Loyle 2004 Cse 397 Triangulated Threshold Graph Our old Graph A B C D E

Benjamin Loyle 2004 Cse 397 Clique Our maximal cliques would be: {A, B, E} {C, D}

Benjamin Loyle 2004 Cse 397 Create Trees for the Cliques We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?

Benjamin Loyle 2004 Cse 397 Tree {A, B, E} and {C,D} A B E CD

Benjamin Loyle 2004 Cse 397 Merge your separate trees together. Create one Supertree This is done by creating a minimum set of edges in the trees and calling that the “backbone” This is it’s own doctorial thesis, so lets do a little hand waving

Benjamin Loyle 2004 Cse 397 That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance. Maximal cliques is only polynomial if the data input is triangulated (which it is!). If all previous are done, creating a supertree can be done in polynomial time as well.

Benjamin Loyle 2004 Cse 397 Where are we now? We now have a finalized phylogeny created for from smaller trees in our matrix joined together Remember we started from all possible size of smaller trees.

Benjamin Loyle 2004 Cse 397 Phase 2 Which one is right? Found using the SQS (Short Quartet Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc Reduce the larger tree to only hold “one set” These are called Quartets

Benjamin Loyle 2004 Cse 397 SQS - A Guide Q(T) is the set of trees induced by T on each set of four leaves. Let Q w (different Q) be a set of quartets with diameter less than or equal to w Find the maximum w where the quartets are inclusive of the nodes of the tree This w is the “support” of that tree

Benjamin Loyle 2004 Cse 397 SQS - Refrased Q w is the set of quartet trees which have a diameter <= w Support of T is the max w where Q w is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,

Benjamin Loyle 2004 Cse 397 Qw = AB C D AB DE ABCDABCDEE

Benjamin Loyle 2004 Cse 397 SQS Method Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the tree found first. This is the tree with the smallest original diameter (remember from phase 1)

Benjamin Loyle 2004 Cse 397 How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy Remove one edge in the tree we’ve created. We now have two trees Is there anyway to create the same set of leaves by removing one edge in our data set? If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not identical

Benjamin Loyle 2004 Cse 397 Performance of DCM * - NJ Outperforms NJ method at sequence lengths above 4000 and with more taxa. NJ DCM-NJ No. Taxa Error Rate

Benjamin Loyle 2004 Cse 397 Improvements Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP) Try and minimize the overall size of the tree Test using statistical evidence Maximum Likelihood (ML)

Benjamin Loyle 2004 Cse 397 Performance gains Simply changing Phase 2 has massive gains in accuracy! DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard. DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

Benjamin Loyle 2004 Cse 397 Comparing Improvements DCM-NJ+SQS NJ DCM-NJ+MP HGT-FP # leaves Error Rate