TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

Slides:



Advertisements
Similar presentations
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Finding new nirK genes in metagenomic data
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Canadian Bioinformatics Workshops
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
Introduction to Profile HMMs
CS 466 and BIOE 498: Introduction to Bioinformatics
TIPP: Taxonomic Identification And Phylogenetic Profiling
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Advances in Ultra-large Phylogeny Estimation
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Chalk Talk Tandy Warnow
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Research in Computational Molecular Biology , Vol (2008)
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
Dr Tan Tin Wee Director Bioinformatics Centre
CS 581 Algorithmic Computational Genomics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles Tandy Warnow and Erin Molloy Department of Computer Science The University of Illinois at Urbana-Champaign

Metagenomic taxonomic identification and phylogenetic profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Basic Questions 1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together? The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

This talk SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs) TIPP (Bioinformatics 2014): Applications of the eHMM technique to metagenomic abundance classification Both available at https://github.com/smirarab/sepp

Phylogenetic Placement Input: Backbone alignment and tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used inside a pipeline, after determining the genes for each of the reads in the metagenomic sample. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Marker-based Taxon Identification Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

Input S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4

Place Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4 Q1

Phylogenetic Placement in 2011 Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

Input S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

Align Sequence using HMMER S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4 Q1

What is HMMER+pplacer? HMMER (Finn et al., NAR 2011) is used to add the read s into the backbone alignment, thus producing an “extended alignment”. HMMER is based on profile Hidden Markov Models (profile HMMs). pplacer (Matsen et al. BMC Bioinformatics 2010) is used to add read s into the best location in the tree T. pplacer is based on phylogenetic sequence evolution models (e.g., GTR), and uses maximum likelihood.

A general topology for a profile HMM From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Profile Hidden Markov Models 18 Profile Hidden Markov Models Profile HMMs are probabilistic generative models to represent multiple sequence alignments. HMMER software suite can Build a profile HMM given a multiple sequence alignment A Use the profile HMM to add a sequence s into A, and return the “probability” that the HMM generated s In other words, profile HMMs can be used to compute extended alignments, and score them!

Input S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S3 S4

Align Sequence using HMMER S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment. Note the maximum likelihood score for the alignment!

What is pplacer? pplacer: software developed by Erick Matsen and colleagues. See http://matsen.fhcrc.org/pplacer/ Input: read s, alignment A (on S and s), tree on S Output: “Best” location to add s in T (under maximum likelihood). For every edge e in T, the value p(e) for the probability for s being placed on e (these probabilities add up to 1)

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 Q1 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

One Hidden Markov Model for the entire alignment? HMM 1

One HMM works beautifully for small-diameter trees

One HMM works poorly for large-diameter trees

One Hidden Markov Model for the entire alignment?

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

SEPP Design To insert query sequence Q1 into backbone tree T Represent the backbone MSA with a collection of profile HMMs, based on maximum ”alignment subset size” Score Q1 against every profile HMM in the collection The best scoring HMM is used to compute the extended alignment Use pplacer to add Q1 into tree T (restricted to ”subtree” based on maximum ”placement subset size”)

One Hidden Markov Model for the entire alignment?

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP: Small alignment subset sizes best Large placement subset size best 10% rule (both subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

SEPP and eHMMs An ensemble of HMMs provides a better model of a multiple sequence alignment than a single HMM, and is better able to detect homology between full length sequences and fragmentary sequences add fragmentary sequences into an existing alignment especially when there are many indels and/or substitutions.

Metagenomic Taxon Identification Objective: classify short reads in a metagenomic sample The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes Computes UPP/PASTA reference alignments Uses reference taxonomies, refined to binary trees using reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes Computes UPP/PASTA reference alignments Uses reference taxonomies, refined to binary trees using reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes Computes UPP/PASTA reference alignments Uses reference taxonomies, refined to binary trees using reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes Computes UPP/PASTA reference alignments Uses reference taxonomies, refined to binary trees using reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignments

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. We compared TIPP to PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

“Novel” genome datasets Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates). Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.

Still to do Evaluate TIPP in comparison to newer methods (e.g., Kraken) Evaluating TIPP with respect to taxonomic identification and identification of novel taxa. Update TIPP’s design!

HIPPI HIerarchical Profile HMMs for Protein family Identification Nguyen, Nute, Mirarab, and Warnow, RECOMB-CG and BMC-Genomics 2016 Uses an ensemble of HMMs to classify protein sequences Tested on HMMER

Protein Family Assignment Input: new AA sequence (might be fragmentary) and database of protein families (e.g., PFAM) Output: assignment (if justified) of the sequence to an existing family in the database

TIPPI: Replacing BLAST by HIPPI within TIPP

Our Publications using eHMMs S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765 All codes are available in open source form at https://github.com/smirarab/sepp

Summary Using an ensemble of HMMs tends to improve accuracy, for a cost of running time. Applications so far to taxonomic placement (SEPP), multiple sequence alignment (UPP), protein family classification (HIPPI). Improvements are mostly noticeable for large diverse datasets. Phylogenetically-based construction of the ensemble helps accuracy (note: the decompositions we produce are not clade-based), but the design and use of these ensembles is still in its infancy. (Many relatively similar approaches have been used by others, including FlowerPower by Sjolander) The basic idea can be used with any kind of probabilistic model, doesn’t have to be restricted to profile HMMs. Basic question: why does it help?

Using SEPP SEPP algorithmic parameters: Alignment subset size (how many sequences for each profile HMM in the ensemble?) Placement subset size (how much of the tree to search for optimal placement?) Default settings are acceptable, but you can improve accuracy by increasing placement subset size (and maybe decreasing alignment subset size), for a running time cost.

Using TIPP TIPP algorithmic parameters (other than SEPP parameters) Reference markers Alignment threshold (default 95%) Placement threshold (default 95%) The alignment threshold and placement threshold were optimized for abundance profiling, not for Taxon ID.

Acknowledgments PhD students: Nam Nguyen (now postdoc at UCSD) and Siavash Mirarab (now faculty at UCSD), and Bo Liu (now at Square) Mihai Pop, University of Maryland NSF grants to TW: DBI:1062335, DEB 0733029, III:AF:1513629 NIH grant to MP: R01-A1-100947 Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources