394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Tutorial 5 Phylogenetic Trees.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Multiple Alignment and Phylogenetic Trees
CIPRES: Enabling Tree of Life Projects
Mathematical and Computational Challenges in Reconstructing Evolution
Hierarchical clustering approaches for high-throughput data
Phylogenetic Trees.
Mathematical and Computational Challenges in Reconstructing Evolution
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
CS 581 Algorithmic Computational Genomics
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
CS 581 Tandy Warnow.
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Computational Genomics Lecture #3a
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

394C, Spring 2012 Jan 23, 2012 Tandy Warnow

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem U V W X Y X U Y V W AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Phylogenetic reconstruction methods Heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic trees Cost Global optimum Local optimum Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian MCMC methods.

Course outline Basics: phylogenies, data, stochastic models of evolution, and representations of trees Phylogeny reconstruction methods: distance-based and character-based (MP, ML, and Bayesian), and their performance issues Multiple sequence alignment, and the connections (both ways) between MSA and phylogenetics Special topics: reticulate evolution, whole genome evolution, metagenomics, etc. (Student interest will impact this.)

Today Newick Representations of trees Characterizations of trees using distances, clades, splits (bipartitions), and quartets Computing trees from dissimilarity matrices: the “naïve” quartet method (Hints) Connections to estimation of phylogenies from empirical data

Newick representations For a rooted tree, we represent a graph with a string with the taxa, commas, and nested parentheses. For example, what tree is represented by (a,(b,(c,((d,e),(f,g))))))? How do we represent an unrooted tree? (Easy - root it somewhere, and write down the Newick representation of the rooted version.

Rooted vs. unrooted Task: be able to move between rooted and unrooted representations of trees Task: be able to compare two trees and see if they are different or the same

Clades Definition: Let T be a rooted tree leaf-labelled by S, let v an internal node in T, and let Xv be the set of leaves in T below v. Let Clades(T) = {Xv: v in V(T)}. Note: Xv is also called the “cluster” at node v, so this is sometimes called Clusters(T). Question: Given Clades(T), can we compute T?

Bipartitions Given an edge e in a leaf-labelled unrooted tree T, the removal of the edge e (but not its endpoints) defines a bipartition on the leaves of the tree T. We denote by ce the bipartition defined by the edge e. We let C(T)={ce: e in E(T)}. Questions: Given C(T), can we compute T?

Quartet subtrees Given tree T leaf-labelled by S, and quartet a,b,c,d of leaves, we let T|{a,b,c,d} denote the minimal homeomorphic subtree of T restricted to {a,b,c,d}. We let Q(T) denote {T|X: X is a four taxon subset of S}. Question: Given Q(T), can we compute T?

Computing trees Given Q(T) (the quartet subtrees of T), can we determine T? Given C(T) (the bipartitions of S defined by the edges of T), can we determine T? Given Clades(T) (the sets of leaves defined by internal nodes in the rooted tree T), can we determine T?

Quartet-based reconstruction Definition: Let T be a tree leaf-labelled by a set S, and let Q(T) be the set of quartet subtrees of T (derived from each of the four-taxon subsets of S). Question: can we reconstruct T from Q(T)?

Computing T from Q(T): Naïve Quartet Method Given Q(T): Find a sibling pair A, B (a pair of leaves which are always together in every quartet in which they both appear) Compute the tree T’ for S-{A} by recursing on the subset of Q(T) that doesn’t include taxon A Insert A into T’ by making A sibling to B, and return the tree obtained

Analysis of the algorithm Questions: Accuracy? Running time? But: how are we to compute quartet subtrees?

Clade compatibility Definition: Let T be a rooted tree leaf-labelled by S, v an internal node in T, and Xv the leaves in T below v. Let Clades(T)={Xv: v in V(T)}. Theorem: Let X be a set of subsets of S. Then there exists a tree T leaf-labelled by S such that X = Clades(T) if and only if for all A, B in X, either A and B are disjoint, or one contains the other.

Proof of the theorem One direction is easy The other direction is a proof by construction!

Computing rooted trees from clades Partially order the set of clades by containment, add in the full set S, and compute the Hasse Diagram of the resultant poset

Tree construction from clades Questions: Accuracy? Running time? But, how are we to compute clades?

Bipartition compatibility Definition: Let X be a set of bipartitions on a set S. Then X is said to be compatible if there exists a tree T leaf-labelled by S such that X = C(T), where C(T) = {ce: e in E(T)}. Question: Can we construct the tree T from C(T)?

Computing trees from bipartitions Given the set of bipartitions on the leaf-set induced by the edges of a tree T, how can we compute the tree T? Hint: “root” the tree T by picking it up at a leaf, and then consider the set of bipartitions as a set of “clades”, and apply the previous algorithm. (Note: the choice of leaf does not matter!)

Computing trees from bipartitions Questions: How are we to obtain bipartitions?

Additive Distance Matrices

Four-point condition Theorem (Buneman and others): A matrix D is additive if and only if for every four indices i,j,k,l, the maximum and median of the three pairwise sums are identical Dij+Dkl < Dik+Djl = Dil+Djk Proof: one direction is easy. The other direction requires some work!

Four-point method then set the tree to be ((si,sj),(sk,sl)). The Four-Point Method computes trees on quartets using the ideas in the Four-point condition Given a “dissimilarity” matrix D (may not satisfy the triangle inequality, but will be symmetric and zero on the diagonal), we compute a tree on four leaves si,sj,sk,sl as follows: If Dij+Dkl is less than both Dik+Djl, and Dil+Djk then set the tree to be ((si,sj),(sk,sl)).

So? We can compute a tree from its set of clades, bipartitions, or quartets. But how do we get these sets? Primary data are generally characters (columns within alignments of biomolecular sequences, morphological features, or other such features). These don’t directly produce clades, bipartitions, or quartets. We can compute a tree from an additive distance matrix. But how do we get these distances? Evolutionary biologists have techniques for estimating “evolutionary distances” between taxa. How do they do this?

Comparing two trees using bipartition sets To see if two trees T and T’ are the same, write down C(T) and C(T’) and see if they are the same set. When computing the error in an estimated tree T with respect to a true tree T*, we set C(T)-C(T*) = false positives, and C(T*)-C(T) = false negatives (missing branches)

Homework assignments Monday Jan 30: Read chapters 1-5. Do problems 2.3(2), 2.5(1), 2.5(2), 2.5(3), and 2.5(5). Monday Feb 6: Do problems 5.2(3), 5.2(4), and 5.2(5). Extra credit: 5.2(6) and 5.2(7). Undergrads or biology students: you have the option of replacing any problems above with other problems (see webpage).