SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.

Slides:



Advertisements
Similar presentations
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Supertrees and the Tree of Life
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Understanding sets of trees CS 394C September 10, 2009.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Tandy Warnow The University of Illinois
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
New methods for simultaneous estimation of trees and alignments
New methods for estimating species trees from gene trees
Imputing Supertrees and Supernetworks from Quartets
Presentation transcript:

SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology

Orangutan GorillaChimpanzee Human (1-3) From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree) “Nothing in Biology makes sense except in the light of evolution” – Dobhzhansky

Tree of Life, Importance to Biology Biomedical applications Mechanisms of evolution Tracking ancient migrations Protein structure and function Drug design 1) Nature Reviews (Genetics) 2) Howard Hughes Medical Institute (BioInteractive) 3) 1000 Genomes Project 1 32 We are here

AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT AAGACTT -3 million yrs -2 million yrs -1 million yrs today TG GACTTAAG G C C T A G GGC A T T AG C CCT A G C ACTT AAGGCCTTGGACTT TAGCCC A TAG A C T TAGC G CTTAGCAC AA AGGGCAT TAGCCCTAGCACTT DNA sequence evolution (idealized)

AGATTA AGACTATGGACATGCGACT AGGTCA UVWXY U VW X Y Phylogeny Problem UVWXY

Two basic approaches for tree estimation on multi-gene datasets Apply phylogeny estimation methods to concatenated (“combined”) sequence alignments for different genes Compute trees on individual genes and apply a supertree method This Talk: SuperFine, boosts supertree methods, enabling faster, more accurate estimation for large scale problems

Using multiple genes gene 1 S1S1 S2S2 S3S3 S4S4 S7S7 S8S8 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA S1S1 S3S3 S4S4 S7S7 S8S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC S4S4 S5S5 S6S6 S7S7

Concatenation gene 1 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 gene 2gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

... Analyze separately Supertree Method Two competing approaches gene 1 gene 2... gene k... Concatenation Species

Why use supertree methods? Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees)

Many Supertree Methods MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more... Matrix Representation with Parsimony (Most commonly used and among most accurate)

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) FN FP 50% error rate

FN rate MRP vs. Concatenation Scaffold Density (%) FN Rate (%) MRP Concatenation Concatenation is not always an option We need better supertree methods

FN Rate SuperFine vs. MRP and Concatenation Scaffold Density (%) FN Rate (%) MRP SuperFine Concatenation

Running Time SuperFine vs. MRP (Concatenation is much slower) MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Minutes MRP SuperFin e

Idea behind SuperFine 1.Construct a supertree with low false positive rate 2.Reduce false negatives by resolving areas of uncertainty using a supertree methodQuartet Max Cut (Swenson et al., Systematic Biology, 2011)

Bipartitions and refinement Let B(T) denote the set of (non-trivial) bipartitions induced by the edges of T. T refines T’ (T’≤T) if B(T)  B(T’) a b c f d ea b c f d e T B(T) = {ab|cdef, abc|def, abcd|ef} T’ B(T’) = {ab|cdef, abc|def} Polytomy Refinement

Idea behind SuperFine 1.Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999) 2.Reduce FN by resolving each polytomy using a supertree method Quartet Max Cut

Strict Consensus Merger (SCM) a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij Swenson, Ph.D. Thesis, 2009

Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges) Runs in polynomial time (in the number of source trees and total number of species)

Idea behind SuperFine 1.Construct a supertree with low FP using SCM 2.Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP) Quartet Max Cut

Resolving a single polytomy, v Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v) Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d} Step 3: Replace the star tree at v by tree t

Back to Our Example e f g a b c d h ij abce h i j dfg a b c d e f g ab c d h ij

Where We Use the Property e f g a b c d h ij a b c d e f g ab c d h ij

Step 1: Reduce each source tree to a tree on the set {1,2,...,d} a b c d e f g ab c d h ij

Step 2: Apply MRP to the collection of reduced trees MRP

Replace polytomy using tree from MRP abce h i j dfg e f g a b c d h ij h d g f i j a b c e

FN Rate SuperFine vs. MRP and Concatenation Scaffold Density (%) FN Rate (%) MRP SuperFine Concatenation

Running Time SuperFine vs. MRP (Concatenation is much slower) MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Minutes MRP SuperFin e

SuperFine: Boosting supertree methods Superfine+MRP vs. MRP (Swenson et al. 2011) – SuperFine combines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time. – Speed-up results from the re-encoding of source trees as smaller trees. SuperFine+QMC vs. QMC (quartet-based) – QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa – SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010) SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012) – SuperFine+MRL, faster and more accurate, similar likelihood scores DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy

Ongoing and Future Work Supertree Methods – Exploring algorithm design space for SuperFine – Incorporate confidence values from source trees Species trees from incongruent gene trees – ASTRAL (in progress) – statistically consistent method, genome-scale – Species tree estimation in the presence of missing data Collaborators: Randy Linder, Rahul Suri, Tandy Warnow Acknowledgements: Kevin Liu, Serita Nelesen, Li-San Wang Funding: NSF DEB , NSF ITR Research Outside of Phylogenetics: Pattern Discovery over Many Weighted Co-Expression Networks (using graph algorithms on distributed cyber-infrastructure) Collaborators: Wenyuan Li, Viktor Prasanna, Santosh Ravi, Yogesh Simman, and Xianghong Jasmine Zhou Funding: NSF CCF , NSF CNS (EAGER)