TIPP and SEPP (plus PASTA)

Slides:

Advertisements

Similar presentations

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.

Advertisements

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Molecular Evolution Revised 29/12/06

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Family of HMMs Nam Nguyen University of Texas at Austin.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.

CS 466 and BIOE 498: Introduction to Bioinformatics

TIPP: Taxonomic Identification And Phylogenetic Profiling

Introduction to Bioinformatics Resources for DNA Barcoding

Advances in Ultra-large Phylogeny Estimation

Phylogenetic basis of systematics

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Chalk Talk Tandy Warnow

Distance based phylogenetics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

Tandy Warnow The University of Illinois

New methods for simultaneous estimation of trees and alignments

Large-Scale Multiple Sequence Alignment

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

CS 581 Tandy Warnow.

Dr Tan Tin Wee Director Bioinformatics Centre

CS 581 Algorithmic Computational Genomics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Tandy Warnow Founder Professor of Engineering

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Taxonomic identification and phylogenetic profiling

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

Advances in Phylogenomic Estimation

Advances in Phylogenomic Estimation

Presentation transcript:

TIPP and SEPP (plus PASTA) Tandy Warnow Department of Computer Science The University of Illinois at Urbana-Champaign

TIPP https://github.com/smirarab/sepp TIPP (Bioinformatics 2014) performs marker-gene based: taxonomic identification (what is this read?) and metagenomic abundance profiling (at some taxonomic level) TIPP uses PASTA (J. Comp. Biol. 2015) to compute large-scale multiple sequence alignments and phylogenetic trees, and SEPP (Pacific Symposium on Biocomputing 2012) to add short reads into taxonomies (more generally, “phylogenetic placement”) TIPP and SEPP each use an “ensemble of profile Hidden Markov Models” (eHMMs) to obtain high accuracy

A general topology for a profile HMM D: deletion state I: insertion state M: match state (correspond to sites in the alignment) Insertion and Match states emit letters (nucleotides, amino acids, other) from a distribution Edges have transition probabilities A path through the profile HMM (with random selection of letters from D and I states) generates a sequence Given a sequence, you can find the maximum likelihood path through the model in polynomial time (dynamic programming) From DOI:10.1109/ICPR.2004.1334187

https://www. slideshare https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: distributions at the species-level True Distribution Estimated Distribution 50% species A 42% species A 20% species B 18% species B 15% species C 11% species C 14% species D 10% species D 1% species E 0% species E 19% unclassified The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Testing TIPP in Bioinformatics 2014 We compared TIPP to PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

“Novel” genome datasets Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. The other tested methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates). Improved accuracy is due to the use of ensembles of profile Hidden Markov Models (eHMMs); single HMMs do not provide the same advantages, especially in the presence of high indel rates.

This talk Basic concepts Taxonomies, Multiple sequence alignments, and phylogenies Phylogenetic placement and taxonomic ID Ensembles of Hidden Markov Models (eHMMs) PASTA (J. Comp Biol. 2015): Computing alignments and trees on large datasets (used for the reference alignments and trees) SEPP (PSB 2012): SATé-enabled Phylogenetic Placement TIPP (Bioinformatics 2014): Applications of the eHMM technique to (a) taxonomic identification and (b) metagenomic abundance classification After my talk, Mike Nute will teach a tutorial on PASTA, SEPP, and TIPP

Phylogenies and Taxonomies Rooted, labels at every node for each taxonomic level More or less based on phylogenies Phylogenies: Usually unrooted (time-reversible models), but outgroups can be used to root estimated phylogenies Estimated from sequences (usually) Branch lengths reflect amount of change Edges/nodes sometimes given with support (typically bootstrap) The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Phylogeny Estimation AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U V W X Y

Rooted neighbor-joining 16S rRNA phylogenetic tree of uncultured bacteria https://www.researchgate.net/Rooted-neighbor-joining-16S-rRNA-gene-based-phylogenetic-tree-of-uncultured-bacteria-The_fig1_279565247 [accessed 26 Jul, 2018]

https://www. slideshare https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies

How are Phylogenies Estimated? Input: Unaligned sequences (DNA, RNA, or AA) Output: Tree with sequences at leaves Standard approach uses two steps: (1) align (2) compute a tree on the alignment Many different techniques for each step The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 20

1000-taxon models, ordered by difficulty (Liu et al., Science 2009) 1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000-taxon models, ordered by difficulty (Liu et al., Science 2009) 21

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset Algorithmic parameter: how to align subsets. Default: MAFFT L-INS-i. C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

SATé-1 (Science 2009) performance 1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? SATé-1 24-hour analysis, on desktop machines (using MAFFT on subsets) (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences.

1000-taxon models ranked by difficulty SATé-1 and SATé-2 (Systematic Biology, 2012) SATé-1: up to 8K SATé-2: up to ~50K 1000-taxon models ranked by difficulty

PASTA: better than SATé-1 and SATé-2

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

PASTA: easy to use GUI https://github.com/smirarab/pasta

The Tutorial (by Mike Nute) PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

This talk

Phylogenetic Placement Input: Backbone alignment and backbone tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Note: if the backbone tree is a Taxonomy, then the placement gives taxonomic information about the query sequences (i.e., reads)! The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Input S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4

Place Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4 Q1

Phylogenetic Placement The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Marker-based Taxon Identification Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

Phylogenetic Placement in 2011 Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

What is HMMER+pplacer? HMMER (Finn et al., NAR 2011) (specifically, HMMAlign) is used to add the read s into the backbone alignment, thus producing an “extended alignment”. HMMAlign is based on profile Hidden Markov Models (profile HMMs). pplacer (Matsen et al. BMC Bioinformatics 2010) is used to add read s into the best location in the tree T. pplacer is based on phylogenetic sequence evolution models (e.g., GTR), and uses maximum likelihood.

A general topology for a profile HMM D: deletion state I: insertion state M: match state (correspond to sites in the alignment) Insertion and Match states emit letters (nucleotides, amino acids, other) from a distribution Edges have transition probabilities A path through the profile HMM (with random selection of letters from D and I states) generates a sequence Given a sequence, you can find the maximum likelihood path through the model in polynomial time (dynamic programming) From DOI:10.1109/ICPR.2004.1334187

Profile Hidden Markov Models 47 Profile Hidden Markov Models Profile HMMs are probabilistic generative models to represent multiple sequence alignments. HMMER software suite can Build a profile HMM given a multiple sequence alignment A Use the profile HMM to add a sequence s into A, and return the “probability” that the HMM generated s (the “score”) Select between different profile HMMs based on score

Input Build a profile HMM for the backbone alignment S1 S2 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.

Align Q1 using HMMER Build a profile HMM for the backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.

Align Q1 using HMMER Build a profile HMM for the backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment. Note the maximum likelihood score for the alignment!

Align Q1 using HMMER Build a profile HMM for the backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment. Note the maximum likelihood score for the alignment!

What is pplacer? pplacer: software developed by Erick Matsen and colleagues. See http://matsen.fhcrc.org/pplacer/ Input: read s, alignment A (on S and s), tree on S Output: “Best” location to add s in T (under maximum likelihood). For every edge e in T, the value p(e) for the probability for s being placed on e (these probabilities add up to 1)

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

Place Sequence using pplacer S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S3 S4 Q1 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

SEPP vs. HMMER, PaPaRa alignments 0.0 0.0 Increasing rate of evolution

One Hidden Markov Model for the entire alignment? HMM 1

One HMM works beautifully for small-diameter trees

One HMM works poorly for large-diameter trees

One Hidden Markov Model for the entire alignment?

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

SEPP Ensemble of HMMs (eHMMs) Construct an eHMM, given an alignment A and tree T on A: Divide the leaves of T into subsets (by deleting centroid edges) until every subset is small enough Build a profile HMM on each subset using HMMER

SEPP Design To insert query sequence Q1 into backbone tree T Represent the backbone MSA with an eHMM, based on maximum alignment subset size Score Q1 against every profile HMM in the collection The best scoring HMM is used to compute the extended alignment Use pplacer on the extended alignment to add Q1 into tree T (restricted to subtree based on maximum placement subset size)

SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy: Small alignment subset sizes best Large placement subset size best But running time and memory problems… Compromise 10% rule (both subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

The Tutorial (by Mike Nute) PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

TIPP https://github.com/smirarab/sepp TIPP (Bioinformatics 2014) performs marker gene-based: taxonomic identification (what is this read?) and metagenomic abundance profiling (at some taxonomic level) TIPP uses PASTA (J. Comp. Biol. 2015) to compute large-scale multiple sequence alignments (one for each marker gene) and SEPP (Pacific Symposium on Biocomputing 2012) to add short reads into refined taxonomies (one for each marker gene)

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

“Novel” genome datasets Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes (discards the others) For each marker: Computes PASTA reference alignments Computes reference taxonomies on the PASTA reference alignment Build eHMM for the PASTA reference alignment Places each read into the appropriate refined taxonomy, using a modification of SEPP (to consider statistical uncertainty in the extended alignment and placement within the refined taxonomy). Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP for Taxonomic ID – output file

54100 = NCBI taxon ID

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP (https://github.com/smirarab/sepp) TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

The Tutorial (by Mike Nute) PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

Using SEPP SEPP algorithmic parameters: Alignment subset size (how many sequences for each profile HMM in the ensemble?) Placement subset size (how much of the tree to search for optimal placement?) Default settings are acceptable, but you can improve accuracy (but increase running time) by: increasing placement subset size and decreasing alignment subset size

Using TIPP TIPP algorithmic parameters (other than SEPP parameters) Reference markers, alignments, and refined taxonomy Alignment threshold (default 95%) Placement threshold (default 95%) Note: The default alignment and placement thresholds were optimized for abundance profiling, not for Taxon ID. Reducing the placement threshold will increase probability of taxonomic classification at the species level (but could also increase the false positive rate)

Using PASTA Main algorithmic parameters in PASTA: Decomposition edge Alignment subset size Subset aligner Alignment merger Tree estimator and ML model Number of iterations Note: type of data (AA or nucleotide) affects subset alignment method (e.g., Muscle is particularly bad choice for AA but not too bad for DNA, MAFFT L-INS-i among best for both) Ask Mike about using BAli-Phy (Bayesian alignment estimation method) within PASTA

TIPP is under development! We are modifying TIPP’s design to improve taxonomic identification and abundance profiling on shotgun sequencing data Stay tuned! Developers: Erin Molloy, Mike Nute, Nidhi Shah, Mihai Pop, and Tandy Warnow

Profile HMMs vs. eHMMs An eHMM is better able to: detect homology between full length sequences and fragmentary sequences add fragmentary sequences into an existing alignment especially when there are many indels and/or substitutions (e.g., in the twilight zone)

Our Publications using eHMMs S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548- 3555. N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765 All codes are available in open source form at https://github.com/smirarab/sepp

Acknowledgments PhD students: Nam Nguyen (now postdoc at UCSD), Siavash Mirarab (now faculty at UCSD), Bo Liu (now at Square), Erin Molloy, Nidhi Shah (Maryland), and Mike Nute Mihai Pop, University of Maryland NSF grants to TW: DBI:1062335, DEB 0733029, III:AF:1513629 NIH grant to MP: R01-A1-100947 Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources