CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
BNFO 602 Phylogenetics Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Multiple Sequence Alignment Methods
Challenges in constructing very large evolutionary trees
Professor Tandy Warnow
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow

Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Reconstructing the “Tree” of Life Handling large datasets: millions of species

Data Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment Molecular markers (e.g., SNPs, RFLPs, etc.) Morphology Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character

Standard Phylogenetic Analyses Step 1: Gather sequence data, and estimate the multiple alignment of the sequences. Step 2: Estimate the evolutionary history from the multiple alignment. (This can result in many trees.) Step 3: Apply consensus methods (and other techniques) to the set of trees to figure out what is reliable.

DNA Sequence Evolution (greatly simplified!) AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Issues in reconstructing evolutionary histories In almost all cases, we don’t know the true evolutionary history of a dataset; how can we tell if we have the right answer? Biostatisticians have addressed this difficulty by modelling the evolutionary process as an unknown stochastic process operating on an unknown tree. This modelling allows us to study phylogeny reconstruction as a statistical inverse problem.

The Jukes-Cantor model of site evolution Each “site” is a position in a sequence The state (i.e., nucleotide) of each site at the root is random The sites evolve independently and identically (i.i.d.) If the site changes its state on an edge, it changes with equal probability to the other states For every edge e, p(e) is defined, which is the probability of change for a random site on the edge e.

Quantifying Topological Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Statistical consistency and convergence rates

Statistical performance issues Statistical consistency: an estimation method is statistically consistent under a model if the probability that the method returns the true tree goes to 1 as the sequence length goes to infinity Convergence rate: the amount of data that a method needs to return the true tree with high probability, as a function of the model tree

Absolute fast convergence vs. exponential convergence

Performance criteria for phylogeny reconstruction methods Speed Space Optimality criterion accuracy “Topological accuracy” (specifically statistical consistency, convergence rate, and performance on finite data) These criteria can be established theoretically, or evaluated on real or simulated data.

Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation

Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation

Statistical research in phylogenetics For a given stochastic model of evolution, prove theorems about the statistical consistency and convergence rates of phylogeny reconstruction methods Propose a new stochastic model of evolution, and study the theoretical performance of new or existing phylogeny reconstruction methods For each of the above, the same issues can be studied in simulation

Statistical performance Standard distance-based methods and Maximum Likelihood (solved exactly) are statistically consistent under standard models Maximum Parsimony is not always statistically consistent, even for the (simplest) DNA model But under more complex models, statistical consistency is impossible! Simulations tell a more complex story…

Methods for phylogenetic inference Statistically consistent methods: some polynomial time distance-based methods, exact solutions to maximum likelihood, and Bayesian MCMC methods, provided that the stochastic model is simple enough Statistically inconsistent methods: Heuristics for hard optimization problems (such as maximum parsimony and maximum likelihood), and exact solutions to maximum parsimony

Empirical and experimental research shows… Statistical inconsistency does not necessarily imply that a method won’t produce a highly accurate estimate of evolution: performance on finite data is different from performance on infinite data “Real” molecular evolution isn’t as simple as the models used in either estimating or simulating evolution All the “best” methods can be extremely hard to solve on large real datasets!

Grand Challenges The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets MCMC methods are increasingly used (often as a surrogate for a decent ML analysis), but it is not clear how to evaluate MCMC methods

Combinatorial algorithms research in phylogenetics For a given optimization problem, determine its computational complexity. Since most problems in this research area are NP- hard, then seek approximation algorithms and prove error bounds. Since approximation algorithms aren’t that useful in practice, develop heuristics and study their performance on real or simulated data. Reconsider whether your optimization problem was worth solving!

Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard (and real datasets can take a very long time) Optimal labeling can be computed in linear time O(nk)

Evaluating MP heuristics with respect to MP scores Time MP score of best trees Performance of Heuristic 1 Performance of Heuristic 2 Fake study

A sample of some algorithmic questions we will discuss Large-scale phylogeny reconstruction (thousands of sequences) Multiple sequence alignment, and especially simultaneous estimation of trees and alignments Reticulate evolution (e.g., detection and reconstruction of horizontal gene transfer) Whole genome phylogenetics (e.g., using rearrangement events) New consensus methods Supertree methods Historical linguistics (i.e., reconstructing the evolutionary history of a language family)

Summary (?) Research in this area combines mathematical modelling, probability theory, combinatorial optimization, and a lot of graph theory, in order to develop algorithms with good performance. Performance is evaluated theoretically, on real data (for optimization criteria), and on simulated data (for topological accuracy).

Other stuff… No biology, probability, or statistics background required Basic background in algorithm design and analysis is expected. However, if you don’t have that background (e.g., if you are a biology student), please come see me. Please look at my class homepage for office hours and other announcements: