CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Family of HMMs Nam Nguyen University of Texas at Austin.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Advances in Ultra-large Phylogeny Estimation
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
CS 581 Algorithmic Computational Genomics
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012

Biology: 21st Century Science! “When the human genome was sequenced seven years ago, scientists knew that most of the major scientific discoveries of the 21st century would be in biology.” January 1, 2008, guardian.co.uk

Genome Sequencing Projects: Started with the Human Genome Project

Whole Genome Shotgun Sequencing: Graph Algorithms and Combinatorial Optimization!

The 1000 Genome Project: using human genetic variation to better treat diseases Where did humans come from, and how did they move throughout the globe?

Other Genome Projects! (Neandertals, Wooly Mammoths, and more ordinary creatures…)

Metagenomics: C. Ventner et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

How did life evolve on earth? Courtesy of the Tree of Life project Current methods often use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week We prove theorems using graph theory and probability theory, and our algorithms are studied on real and simulated data.

This course Fundamental mathematics of phylogeny and alignment estimation Applied research problems: –Metagenomics –Simultaneous estimation of alignments and trees –Ultra-large alignment and tree estimation –Phylogenomics –De novo genome assembly –Historical linguistics

Phylogenetic trees can be based upon morphology

But some estimations need DNA! OrangutanGorillaChimpanzeeHuman

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

1.Polynomial time distance-based methods (e.g., Neighbor- Joining) 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 3.Bayesian methods

The neighbor joining method has high error rates on large trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] NJ No. Taxa Error Rate

And solving NP-hard optimization problems in phylogenetics is … unlikely # of Taxa # of Unrooted Trees x x x

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA…

UVWXYUVWXY U VW X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT

…ACGGTGCAGTTACCA… …ACCAGTCACCA… Mutation Deletion The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Simulation Studies S1S2 S3S4 S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA Compare True tree and alignment S1S4 S3S2 Estimated tree and alignment Unaligned Sequences

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Problems Large datasets with high rates of evolution are hard to align accurately, and phylogeny estimation methods produce poor trees when alignments are poor. Many phylogeny estimation methods have poor accuracy on large datasets (even if given correct alignments) Potentially useful genes are often discarded if they are difficult to align. These issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

Major Challenges Current phylogenetic datasets contain hundreds to thousands of taxa, with multiple genes. Future datasets will be substantially larger (e.g., iPlant plans to construct a tree on 500,000 plant species) Current methods have poor accuracy or cannot run on large datasets.

Theoretical Challenges: NP-hard problems Model violations The Tree of Life Empirical Challenges: Alignment estimation Data insufficient OR too much data Heuristics insufficient

Phylogenetic “boosters” (meta-methods) Goal: improve accuracy, speed, robustness, or theoretical guarantees of base methods Examples: DCM-boosting for distance-based methods (1999) DCM-boosting for heuristics for NP-hard problems (1999) SATé-boosting for alignment methods (2009) SuperFine-boosting for supertree methods (2011) DACTAL-boosting for all phylogeny estimation methods (2011) SEPP-boosting for metagenomic analyses (2011)

Disk-Covering Methods (DCMs) (starting in 1998)

DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method MDCM-M

The neighbor joining method has high error rates on large trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] NJ No. Taxa Error Rate

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ No. Taxa Error Rate

SATé: Simultaneous Alignment and Tree Estimation (Liu et al., Science 2009, and Liu et al. Systematic Biology, in press) DACTAL: Divide-and-Conquer Trees (Almost) without alignments (Nelesen et al., submitted) SEPP: SATé-enabled Phylogenetic Placement (Mirarab, Nguyen and Warnow, to appear, PSB 2012) Other “boosters”

SATé Algorithm (Liu et al. Science 2009) T A Use new tree (T) to compute new alignment (A) Estimate ML tree on new alignment Obtain initial alignment and estimated ML tree T SATé = Simultaneous Alignment and Tree Estimation

A B D C Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree AB CD Align subproblems AB CD ABCD One SATé iteration (really 32 subsets) e

Results on 1000-taxon datasets 24 hour SATé analysis Other simultaneous estimation methods cannot run on large datasets

Limitations of SATé A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblem s AB CD ABCD

Part II: DACTAL (Divide-And-Conquer Trees (Almost) without alignments) Input: set S of unaligned sequences Output: tree on S (but no alignment) (Nelesen, Liu, Wang, Linder, and Warnow, submitted)

DACTAL New supertree method: SuperFine Existing Method: RAxML(MAFFT) pRecDCM3 BLAST- based Overlapping subsets A tree for each subset Unaligned Sequences A tree for the entire dataset

Average of 3 Largest CRW Datasets CRW: Comparative RNA database, Three 16S datasets with 6,323 to 27,643 sequences Reference alignments based on secondary structure Reference trees are 75% RAxML bootstrap trees DACTAL (shown in red) run for 5 iterations starting from FT(Part) FastTree (FT) and RAxML are ML methods

Observations DACTAL gives more accurate trees than all other methods on the largest datasets DACTAL is much faster than SATé (and can analyze datasets that SATé cannot) DACTAL is robust to starting trees and other algorithmic parameters

Taxon Identification in Metagenomics Input: set of shotgun sequences (very short) Output: a tree on the set of sequences, indicating the species identification of each sequence Issues: the sequences are not globally alignable, they are very short, and there are millions of them

Phylogenetic Placement Input: Backbone alignment and tree on full-length sequences, and a set of query sequences (short fragments) Output: Placement of query sequences on backbone tree Applications: taxon identification of metagenomic data, phylogenetic analyses of NGS data.

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

Place Sequence S1 S4 S2 S3 Q1 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

HMMER vs. PaPaRa Alignments Increasing rate of evolution 0.0

Divide-and-conquer with HMMER+pplacer

SEPP (10%-rule) on simulated data 0.0 Increasing rate of evolution

Historical linguistics Languages evolve, just like biological species. How can we determine how languages evolve? How can we use information on language evolution, to determine how human populations moved across the globe?

Questions about Indo-European (IE) How did the IE family of languages evolve? Where is the IE homeland? When did Proto-IE “end”? What was life like for the speakers of proto- Indo-European (PIE)?

Estimating the date and homeland of the proto-Indo-Europeans Step 1: Estimate the phylogeny Step 2: Reconstruct words for proto- Indo-European (and for intermediate proto-languages) Step 3: Use archaeological evidence to constrain dates and geographic locations of the proto-languages

“Perfect Phylogenetic Network” (Nakhleh et al., Language)

Reticulate evolution Not all evolution is tree-like: –Horizontal gene transfer –Hybrid speciation How can we detect reticulate evolution?

Course Details Phylogeny and multiple sequence alignment are the basis of almost everything in the course The first 1/3 of the class will provide the basics of the material The next 2/3 will go into depth into selected topics

Course details There is no textbook; I will provide notes. Homeworks: basic material and critical review of papers from the scientific literature Course project: either a research project (two students per project) or a literature survey (one student per project). The best projects should be submitted for publication in a journal or conference. Final exam: comprehensive, take home.

Grading Homework: 20% Class participation: 20% Final exam: 30% Class project: 30%

Combined Analysis Methods gene 1 S1S1 S2S2 S3S3 S4S4 S7S7 S8S8 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA S1S1 S3S3 S4S4 S7S7 S8S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC S4S4 S5S5 S6S6 S7S7

Combined Analysis gene 1 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 gene 2gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

... Analyze separately Supertree Method Two competing approaches gene 1 gene 2... gene k... Combined Analysis Species