394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.

Slides:



Advertisements
Similar presentations
An introduction to maximum parsimony and compatibility
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Bioinformatics Algorithms and Data Structures
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Based on the paper by D.Huson, S.Nettles, T.Warnow
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Phylogenetic trees Sushmita Roy BMI/CS 576
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Multiple Alignment and Phylogenetic Trees
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

394C, Spring 2013 Sept 4, 2013 Tandy Warnow

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

1.Heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian MCMC methods.

Today Newick Representations of trees Characterizations of trees using distances, clades, splits (bipartitions), and quartets Computing trees from dissimilarity matrices: the “ naïve ” quartet method (Hints) Connections to estimation of phylogenies from empirical data

Newick representations For a rooted tree, we represent a graph with a string with the taxa, commas, and nested parentheses. For example, what tree is represented by (a,(b,(c,((d,e),(f,g))))))? How do we represent an unrooted tree? (Easy - root it somewhere, and write down the Newick representation of the rooted version.

Rooted vs. unrooted Task: be able to move between rooted and unrooted representations of trees Task: be able to compare two trees and see if they are different or the same

Clades Definition: Let T be a rooted tree leaf-labelled by S, let v an internal node in T, and let X v be the set of leaves in T below v. Let Clades(T) = {X v : v in V(T)}. Note: X v is also called the “ cluster ” at node v, so this is sometimes called Clusters(T). Question: Given Clades(T), can we compute T?

Bipartitions Given an edge e in a leaf-labelled unrooted tree T, the removal of the edge e (but not its endpoints) defines a bipartition on the leaves of the tree T. We denote by c e the bipartition defined by the edge e. We let C(T)={c e : e in E(T)}. Questions: Given C(T), can we compute T?

Quartet subtrees Given tree T leaf-labelled by S, and quartet a,b,c,d of leaves, we let T|{a,b,c,d} denote the minimal homeomorphic subtree of T restricted to {a,b,c,d}. We let Q(T) denote {T|X: X is a four taxon subset of S}. Question: Given Q(T), can we compute T?

Computing trees Given Q(T) (the quartet subtrees of T), can we determine T? Given C(T) (the bipartitions of S defined by the edges of T), can we determine T? Given Clades(T) (the sets of leaves defined by internal nodes in the rooted tree T), can we determine T?

Quartet-based reconstruction Definition: Let T be a tree leaf-labelled by a set S, and let Q(T) be the set of quartet subtrees of T (derived from each of the four-taxon subsets of S). Question: can we reconstruct T from Q(T)?

Computing T from Q(T): Naïve Quartet Method Given Q(T): –Find a sibling pair A, B (a pair of leaves which are always together in every quartet in which they both appear) –Compute the tree T ’ for S-{A} by recursing on the subset of Q(T) that doesn ’ t include taxon A –Insert A into T ’ by making A sibling to B, and return the tree obtained

Analysis of the algorithm Questions: Accuracy? Running time? But: how are we to compute quartet subtrees?

Clade compatibility Definition: Let T be a rooted tree leaf-labelled by S, v an internal node in T, and X v the leaves in T below v. Let Clades(T)={X v : v in V(T)}. Theorem: Let X be a set of subsets of S. Then there exists a tree T leaf-labelled by S such that X = Clades(T) if and only if for all A, B in X, either A and B are disjoint, or one contains the other.

Proof of the theorem One direction is easy The other direction is a proof by construction!

Computing rooted trees from clades Partially order the set of clades by containment, add in the full set S, and compute the Hasse Diagram of the resultant poset

Tree construction from clades Questions: Accuracy? Running time? But, how are we to compute clades?

Bipartition compatibility Definition: Let X be a set of bipartitions on a set S. Then X is said to be compatible if there exists a tree T leaf-labelled by S such that X = C(T), where C(T) = {c e : e in E(T)}. Question: Can we construct the tree T from C(T)?

Computing trees from bipartitions Given the set of bipartitions on the leaf-set induced by the edges of a tree T, how can we compute the tree T? Hint: “ root ” the tree T by picking it up at a leaf, and then consider the set of bipartitions as a set of “ clades ”, and apply the previous algorithm. (Note: the choice of leaf does not matter!)

Computing trees from bipartitions Questions: How are we to obtain bipartitions?

Additive Distance Matrices

Four-point condition Theorem (Buneman and others): A matrix D is additive if and only if for every four indices i,j,k,l, the maximum and median of the three pairwise sums are identical D ij +D kl < D ik +D jl = D il +D jk Proof: one direction is easy. The other direction requires some work!

Four-point method The Four-Point Method computes trees on quartets using the ideas in the Four-point condition Given a “ dissimilarity ” matrix D (may not satisfy the triangle inequality, but will be symmetric and zero on the diagonal), we compute a tree on four leaves s i,s j,s k,s l as follows: If D ij +D kl is less than both D ik +D jl, and D il +D jk then set the tree to be ((s i,s j ),(s k,s l )).

So? We can compute a tree from its set of clades, bipartitions, or quartets. But how do we get these sets? –Primary data are generally characters (columns within alignments of biomolecular sequences, morphological features, or other such features). These don ’ t directly produce clades, bipartitions, or quartets. We can compute a tree from an additive distance matrix. But how do we get these distances? –Evolutionary biologists have techniques for estimating “ evolutionary distances ” between taxa. How do they do this?

Comparing two trees using bipartition sets To see if two trees T and T ’ are the same, write down C(T) and C(T ’ ) and see if they are the same set. When computing the error in an estimated tree T with respect to a true tree T*, we set –C(T)-C(T*) = false positives, and –C(T*)-C(T) = false negatives (missing branches)

Consensus Trees Given a set of trees, we compute consensus trees to represent what they have in common. For example: –The strict consensus –The majority consensus –The maximum agreement subtree –The maximum compatible subset tree Not all of these problems are solvable in polynomial time.

Compatibility trees The compatibility tree of a set of trees (all on the same set of leaves) is the minimal common refinement, if it exists. Determining if a set of trees have a common refinement is solvable in polynomial time.

Homework assignments Wednesday, Sept 11, Read Chapters 2-5. Do problems 2.3(2), 2.5(1), 2.5(2), 2.5(3), 2.5(5), and 2.7(1). Wednesday, Sept 18: Do problems 5.2(1), 5.2(3), 5.2(4), and 5.2(5). Extra credit: 5.2(6) and 5.2(7). Undergrads or biology students: you have the option of replacing any problems above with other problems (see webpage).