Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

Slides:



Advertisements
Similar presentations
New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.
Advertisements

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
IPAM Workshop on Multiple Sequence Alignment Tandy Warnow The University of Illinois at Urbana-Champaign.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Family of HMMs Nam Nguyen University of Texas at Austin.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
Advances in Ultra-large Phylogeny Estimation
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
The ideal approach is simultaneous alignment and tree estimation.
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
New methods for simultaneous estimation of trees and alignments
CS 581 Algorithmic Computational Genomics
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and the Smithsonian Institution

Species phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Project Components Algorithms and Software Simulations Outreach to ATOL and the scientific community Undergraduate training

Personnel Tandy Warnow (UT-Austin) Mark Holder (Kansas) Jim Leebens-Mack (UGA) Randy Linder (UT-Austin) Etsuko Moriyama (UNL) Michael Braun (Smithsonian) Webb Miller (PSU) Usman Roshan (NJIT) Postdocs: Derrick Zwickl (NESCENT) PhD Students: Cory Strope (UNL), Serita Nelesen (UT-Austin), Kevin Liu (UT-Austin), Sindhu Raghavan (UT-Austin), Michael McKain (UGA) Undergraduates: from Huston-Tillotson and the University of Georgia

Step 1: Gather data S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Step 2: Multiple Sequence Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Step 3: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Multiple Sequence Alignment -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-- TAG-CT GACCGC-- Notes: 1. We insert gaps (dashes) to each sequence to make them “line up”. 2. Nucleotides in the same column are presumed to have a common ancestor (i.e., they are “homologous”). AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC

Alignment methods The standard alignment method for phylogeny is Clustal (or one of its derivatives) Others: ProbCons, MAFFT, Muscle, POA, POY, T-Coffee, Di-Align On the basis of various tests, ProbCons, Mafft, and Muscle are generally considered the “best”.

Basic Questions Using simulations: Does improving the alignment lead to an improved phylogeny? Using Tree of Life (real) datasets: –How much does changing the alignment method change the resultant alignments? –How much does changing the alignment method change the estimated tree? –What gap patterns do we see on hand-curated alignments, and what biological processes created them?

Basic Questions Using simulations: Does improving the alignment lead to an improved phylogeny? Using Tree of Life (real) datasets: –How much does changing the alignment method change the resultant alignments? –How much does changing the alignment method change the estimated tree? –What gap patterns do we see on hand-curated alignments, and what biological processes created them?

Simulation study Simulate sequence evolution down a tree Estimate alignments on each set of sequences Compare estimated alignments to the true alignment Estimate trees on each alignment Compare estimated trees to the true tree

Simulation study Simulate sequence evolution down a tree Estimate alignments on each set of sequences Compare estimated alignments to the true alignment Estimate trees on each alignment Compare estimated trees to the true tree

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

indels (insertions and deletions) also occur! …ACGGTGCAGTTACCA… …ACCAGTCACCA… MutationDeletio n

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA…

…ACGGTGCAGTTACCA… …ACCAGTCACCA… Mutation Deletion The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

Alignment Error Calculation A C A T G True alignment C A A - G A T G A C A T G Est. alignment - C A A G A T G

Alignment Error Calculation A C A T G True alignment C A A - G A T G A C A T G Est. alignment - C A A G A T G 75% of the correct pairs are missing!

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Our simulation studies (using ROSE*) Amino-acid evolution (Wang et al., unpublished): –BaliBase and birth-death model trees, 12 taxa to 100 taxa. –Average gap length 3.4. –Average identity 23% to 57%. –Average gappiness 3% to 60%. DNA sequence evolution (Liu et al., unpublished): –Birth-death trees, 25 to 500 taxa. –Two gap length distributions (short and long). –Average p-distance 43% to 63%. –Average gappiness 40% to 80%. *ROSE has limitations!

Non-coding DNA evolution Models 1-4 have “long gaps”, and models 5-8 have “short gaps”

Observations Phylogenetic tree accuracy is positively correlated with alignment accuracy (measured using SP), but the degree of improvement in tree accuracy is much smaller. The best two-phase methods are generally (but not always!) obtained by using either ProbCons or MAFFT, followed by Maximum Likelihood. However, even the best two-phase methods don ’ t do well enough.

Progress so far Experimental evaluation of existing alignment methods (Wang, Leebens-Mack, de Pamphilis and Warnow) - submitted Impact of guide trees (Nelesen, Liu, Linder, and Warnow): Pacific Symp. Biocomputing 2008 Better ways to run POY: Liu, Nelesen, Raghavan, Linder, and Warnow (submitted) SATé: new technique for Simultaneous Alignment and Tree Estimation: Liu, Nelesen, Linder and Warnow (in preparation)

SATe: (Simultaneous Alignment and Tree Estimation) Developers: Warnow, Linder, Liu, and Nelesen. Technique: search through tree/alignment space (align sequences on each tree by heuristically estimating ancestral sequences and compute ML trees on the resultant multiple alignments). SATe returns the alignment/tree pair that optimizes maximum likelihood under GTR+Gamma+I.

Our method (SAT é ) vs. other methods 100 taxon model trees, GTR+Gamma+gap, Long gap models 1-4, short gap models 5-8

Undergraduate Training Two institutions involved: UT-Austin partnership with Huston- Tillotson, and the University of Georgia Training via: –Research projects –Summer training with the project members –Participation in the project meeting –Participation at a conference –Lectures by project participants at the collaborating institutions Focus group leader(s): Jim Leebens-Mack and Randy Linder

Undergraduate Research Programs at the University of Georgia

Louis Stokes Alliance for STEM Research

University of Texas Collaboration with Huston- Tillotson University

Research projects for undergrads Studying the AToL (Assembling the Tree of Life) project datasets: –Produce alignments on each dataset, (using existing alignment methods and our new SATe method), and compute trees on each alignment –Study differences between alignments and between trees Evaluating the simulation software Creating a webpage about alignment research Others?