Chalk Talk Tandy Warnow

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
TIPP: Taxonomic Identification And Phylogenetic Profiling
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Advances in Ultra-large Phylogeny Estimation
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
CS 581 Tandy Warnow.
Dr Tan Tin Wee Director Bioinformatics Centre
CS 581 Algorithmic Computational Genomics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

Chalk Talk Tandy Warnow Departments of Computer Science and Bioengineering University of Illinois at Urbana-Champaign

The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima

The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity Applications areas: metagenomics protein structure and function prediction trait evolution detection of co-evolution systems biology

The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity Techniques: Graph theory (especially chordal graphs) Probability theory and statistics Hidden Markov models Combinatorial optimization Heuristics Supercomputing

Overview Theory: combining probability theory, graph theory, and optimization Simulations: evaluating methods under stochastic models of sequence evolution Biological data analysis: refining methods and enabling discovery Open source software development High performance computing Applications outside biology (e.g., historical linguistics, big data problems in general)

Past Work (highlights) Gene tree estimation (theoretical results under stochastic models of sequence evolution) Multiple sequence alignment on large datasets, and co-estimation of alignments and trees Phylogenetic networks and species trees from multi-locus datasets Genome rearrangement phylogeny Supertree methods Metagenomics Historical linguistics

Future work Theory, methods, and empirical studies for Genome-scale phylogeny estimation addressing multiple sources for gene tree heterogeneity Microbiome analysis Ultra-large multiple sequence alignment and tree estimation And applications of these techniques outside biology

Current NSF grants Graph-theoretic methods to improve phylogenomic analyses (joint with Chandra Chekuri and Satish Rao) – NSF CCF-1535977 Multiple Sequence Alignment: NSF ABI- 1458652 Metagenomics: joint with Mihai Pop and Bill Gropp. NSF grant III:AF:1513629

Current NSF grants Graph-theoretic methods to improve phylogenomic analyses (joint with Chandra Chekuri and Satish Rao) – NSF CCF-1535977 Multiple Sequence Alignment: NSF ABI- 1458652 Metagenomics: joint with Mihai Pop and Bill Gropp. NSF grant III:AF:1513629

Major Areas Phylogenomics: Species tree and network estimation using whole genomes (and gene tree estimation in the context of whole genomes) Multiple Sequence Alignment: Inferring relationships between letters in molecular sequences, especially on very large datasets (up to 1,000,000 sequences) Metagenomics: Analysis of molecular sequences obtained from environmental samples (joint with Mihai Pop and Bill Gropp) Scaling computationally intensive methods to large datasets: Combining discrete math and statistical methods to enable highly accurate analysis of ultra-large datasets (joint with Chandra Chekuri and Satish Rao)

Phylogenomics = Species trees from whole genomes “Nothing in biology makes sense except in the light of evolution” - Dobhzansky

Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

Incomplete Lineage Sorting (ILS) Confounds phylogenetic analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS, focused around statistical consistency guarantees (theory) and performance on data.

Main competing approaches gene 1 gene 2 . . . gene k . . . Species Concatenation . . . Analyze separately point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. Summary Method

Statistical Consistency error Data

Main competing approaches gene 1 gene 2 . . . gene k . . . Species Concatenation . . . Analyze separately point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. Summary Method

Constrained MQST (Maximum Quartet Support Tree) Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipartitions Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipartitions from X. Theorems (Mirarab et al., 2014): If X contains the bipartitions from the input gene trees (and perhaps others), then an exact solution to this problem is statistically consistent under the MSC. The constrained MQST problem can be solved in O(|X|2nk) time. (We use dynamic programming, and build the unrooted tree from the bottom-up, based on “allowed clades” – halves of the allowed bipartitions.)

ASTRAL is fairly robust to HGT + ILS Davidson et al., RECOMB-CG, BMC Genomics 2015

Contributions (sample) Methods for estimating species trees from genome-scale data: ASTRAL (Mirarab et al., Bioinformatics 2014, 2015) and ASTRID (Vachaspati and Warnow, BMC Genomics 2015): polynomial time methods that are statistically consistent under the MSC. Both can analyze very large datasets (1000 species and 1000 genes – or more) with high accuracy. Statistical binning (Mirarab et al., Science 2014, Bayzid et al. PLOS One 2015) can reduce gene tree estimation error, and lead to improved species tree estimations (topology, branch lengths, and incidence of false positives) BBCA (Zimmermann et al., BMC Genomics 2014) enables Bayesian co-estimation methods to scale to large numbers of genes DCM-boosting (Bayzid et al., BMC Genomics 2014) enables computationally intensive methods to scale to large numbers of species Mathematical theory: Roch and Warnow, Systematic Biology 2015) regarding statistical consistency under the MSC given finite length sequences. Uricchio et al., BMC Bioinformatics 2016, number of loci needed to recover all the splits with high probability Biological data analyses: Avian phylogenomics project (Jarvis, Mirarab et al., Science 2014) Thousand Plant Transcriptome Project (Wickett, Mirarab et al. PNAS 2014) Tarver et al. Genome Biology and Evolution 2016, Mammalian phylogeny

Current NSF grants Graph-theoretic methods to improve phylogenomic analyses (joint with Chandra Chekuri and Satish Rao) – NSF CCF-1535977 Multiple Sequence Alignment: NSF ABI- 1458652 Metagenomics: joint with Mihai Pop and Bill Gropp. NSF grant III:AF:1513629

Current NSF grants Graph-theoretic methods to improve phylogenomic analyses (joint with Chandra Chekuri and Satish Rao) – NSF CCF-1535977 Multiple Sequence Alignment: NSF ABI- 1458652 Metagenomics: joint with Mihai Pop and Bill Gropp. NSF grant III:AF:1513629

Metagenomic taxonomic identification and phylogenetic profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Basic Questions 1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together? The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

This talk SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs) Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)

Phylogenetic Placement Input: Backbone alignment and tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used inside a pipeline, after determining the genes for each of the reads in the metagenomic sample. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Marker-based Taxon Identification Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4

Place Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4 Q1

Phylogenetic Placement Align each query sequence to backbone alignment HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree Pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

One Hidden Markov Model for the entire alignment?

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP 10% rule (subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

Marker-based Taxon Identification Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes Computes UPP/PASTA reference alignments Uses reference taxonomies, refined to binary trees using reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

“Novel” genome datasets Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates). Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.

SEPP and eHMMs An ensemble of HMMs provides a better model of a multiple sequence alignment than a single HMM, and is better able to detect homology between full length sequences and fragmentary sequences add fragmentary sequences into an existing alignment especially when there are many indels and/or substitutions.

Our Publications using eHMMs S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765 All codes are available in open source form at https://github.com/smirarab/sepp

Overview Theory: combining probability theory, graph theory, and optimization Simulations: evaluating methods under stochastic models of sequence evolution Biological data analysis: refining methods and enabling discovery Open source software development High performance computing Applications outside biology (e.g., historical linguistics, big data problems in general)

Computational Phylogenomics NP-hard problems Large datasets Complex statistical estimation problems Metagenomics Protein structure and function prediction Medical forensics Systems biology Population genetics

Future Work - Phylogenomics Better theory, addressing impact of gene tree estimation error and missing data Fast genome-scale phylogenetic tree estimation (high performance computing, statistically-based estimation taking multiple sources of discord into account) Phylogenetic network construction on large datasets (statistical methods within divide-and-conquer framework) Better statistical models of sequence evolution, addressing heterotachy Co-estimation of gene trees and species trees/networks

Future work - Metagenomics Improved marker-based analyses, and addressing gene tree heterogeneity Rigorous methods for detecting novel genes and species High throughput analysis with high sensitivity Metagenome assembly HPC implementations Collaborations with biologists and biomedical researchers

Future work – Multiple Sequence Alignment Improved large-scale MSA (e.g., PASTA and UPP) Extending statistical co-estimation of trees and MSA to large datasets (e.g., Nute and Warnow 2016) Efficient and useful sampling of MSAs MSA estimation in the presence of duplications and rearrangements (e.g., whole genome alignment) Better HMM+phylogeny models that are useful for estimating alignments and trees

Future work - Theory Basic algorithmic challenges: supertrees computing trees from distance matrices using chordal graphs for divide-and-conquer Consensus trees Applied probability: Trade-off between data quality and quantity (e.g., statistical binning) Identifiability of tree models with noisy data Understanding ensembles of HMMs