TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Lichens and Ascomycota broadly Alternative markers to COI ITS.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Molecular Evolution Revised 29/12/06
J AMES A. FOSTER And Luke Sheneman 1 October 2008 I NITIATIVE FOR B IOINFORMATICS AND E VOLUTIONARY S TUDIES (IBEST) Guide Trees and Progressive Multiple.
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Recent breakthroughs in mathematical and computational phylogenetics
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
TIPP: Taxonomic Identification And Phylogenetic Profiling
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Advances in Ultra-large Phylogeny Estimation
Chalk Talk Tandy Warnow
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Research in Computational Molecular Biology , Vol (2008)
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
Taxonomic identification and phylogenetic profiling
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Nidhi Shah University of Maryland
Scaling Species Tree Estimation to Large Datasets
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign

Metagenomic taxonomic identification and phylogenetic profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic Taxon Identification Objective: classify short reads in a metagenomic sample

1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together? Basic Questions

This talk SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs) Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)

Phylogenetic Placement Input: Backbone alignment and tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used inside a pipeline, after determining the genes for each of the reads in the metagenomic sample.

Marker-based Taxon Identification ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG. ACCT Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

Place Sequence S1 S4 S2 S3 Q1 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

Phylogenetic Placement Align each query sequence to backbone alignment – HMMALIGN (Eddy, Bioinformatics 1998) – PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree – Pplacer (Matsen et al., BMC Bioinformatics, 2011) – EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood

HMMER vs. PaPaRa Alignments Increasing rate of evolution 0.0

One Hidden Markov Model for the entire alignment?

HMM 1

Or 2 HMMs? HMM 1 HMM 2

HMM 1 HMM 3 HMM 4 HMM 2 Or 4 HMMs?

SEPP Parameter Exploration  Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP  10% rule (subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 Increasing rate of evolution

SEPP and eHMMs An ensemble of HMMs provides a better model of a multiple sequence alignment than a single HMM, and is better able to detect homology between full length sequences and fragmentary sequences add fragmentary sequences into an existing alignment especially when there are many indels and/or substitutions.

TIPP: high accuracy taxonomic identification of metagenomic data TIPP: taxon identification using phylogeny-aware profiles (Mirarab et al., Bioinformatics, 2014), marker-based method that only characterizes those reads that map to the marker genes Pipeline that: Computes UPP/PASTA reference alignments Uses reference taxonomies Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can be used for taxon identification or abundance profiling Available in open source form on github

Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E Abundance Profiling

TIPP Design 1.Uses selected marker genes (we show results using Metaphyler’s 30 marker genes, by Mihai Pop and his group), with reference alignments that we compute using UPP/PASTA. 2.Assigns reads to bins (homologous to marker genes, or unbinned). 3.For each marker gene, and its associated bin of reads: i.Taxonomically characterizes each read in the gene’s bin, by using an eHMM for a reference alignment for the gene, and a reference taxonomy. ii.This produces a set of placements within the taxonomy, each with statistical support, and allows the user to trade-off sensitivity and specificity. iii.The outcome is finally one taxonomic characterization (though not necessarily at the species level) for each read. 4.Pools all the classifications together.

TIPP Design (Step 3) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, and support threshold (typically 95%) For each marker gene, and its associated bin of reads: – Builds eHMM to represent the MSA – For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at this MRCA.

Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission. Abundance Profiling

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

“Novel” genome datasets Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates). Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.

Comments Marker-based abundance profiling methods can be more accurate than analyses that classify all the reads. Improved results possible using: – Better sets of marker genes – Better ways of assigning reads to marker genes – Better taxonomies, or use of estimated gene trees and species trees – Better ways of combining classifications from multiple markers

Publications using eHMMs S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012).(PSB 2012) N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics, 2014; /bioinformatics.btu721.full.pdf+html /bioinformatics.btu721.full.pdf+html N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) (PDF)(PDF) SEPP, TIPP, and UPP are all available in open source form on github (smirarab/sepp)

Acknowledgments PhD students: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), and Bo Liu (now at Square) Mihai Pop, University of Maryland NSF grants to TW: DBI: , DEB , III:AF: NIH grant to MP: R01-A Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources