Presentation is loading. Please wait.

Presentation is loading. Please wait.

TIPP and SEPP (plus PASTA)

Similar presentations


Presentation on theme: "TIPP and SEPP (plus PASTA)"— Presentation transcript:

1 TIPP and SEPP (plus PASTA)
Tandy Warnow Department of Computer Science The University of Illinois at Urbana-Champaign

2 TIPP https://github.com/smirarab/sepp
TIPP (Bioinformatics 2014) performs marker-gene based: taxonomic identification (what is this read?) and metagenomic abundance profiling (at some taxonomic level) TIPP uses PASTA (J. Comp. Biol. 2015) to compute large-scale multiple sequence alignments and phylogenetic trees, and SEPP (Pacific Symposium on Biocomputing 2012) to add short reads into taxonomies (more generally, “phylogenetic placement”) TIPP and SEPP each use an “ensemble of profile Hidden Markov Models” (eHMMs) to obtain high accuracy

3

4 A general topology for a profile HMM
D: deletion state I: insertion state M: match state (correspond to sites in the alignment) Insertion and Match states emit letters (nucleotides, amino acids, other) from a distribution Edges have transition probabilities A path through the profile HMM (with random selection of letters from D and I states) generates a sequence Given a sequence, you can find the maximum likelihood path through the model in polynomial time (dynamic programming) From DOI: /ICPR

5 https://www. slideshare

6 Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: distributions at the species-level True Distribution Estimated Distribution 50% species A % species A 20% species B % species B 15% species C % species C 14% species D % species D 1% species E % species E 19% unclassified The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

7 Testing TIPP in Bioinformatics 2014
We compared TIPP to PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

8 High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

9 “Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

10 TIPP vs. other abundance profilers
TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. The other tested methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates). Improved accuracy is due to the use of ensembles of profile Hidden Markov Models (eHMMs); single HMMs do not provide the same advantages, especially in the presence of high indel rates.

11 This talk Basic concepts
Taxonomies, Multiple sequence alignments, and phylogenies Phylogenetic placement and taxonomic ID Ensembles of Hidden Markov Models (eHMMs) PASTA (J. Comp Biol. 2015): Computing alignments and trees on large datasets (used for the reference alignments and trees) SEPP (PSB 2012): SATé-enabled Phylogenetic Placement TIPP (Bioinformatics 2014): Applications of the eHMM technique to (a) taxonomic identification and (b) metagenomic abundance classification After my talk, Mike Nute will teach a tutorial on PASTA, SEPP, and TIPP

12 Phylogenies and Taxonomies
Rooted, labels at every node for each taxonomic level More or less based on phylogenies Phylogenies: Usually unrooted (time-reversible models), but outgroups can be used to root estimated phylogenies Estimated from sequences (usually) Branch lengths reflect amount of change Edges/nodes sometimes given with support (typically bootstrap) The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

13 Phylogeny Estimation AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U V W X Y

14 Rooted neighbor-joining 16S rRNA phylogenetic tree of uncultured bacteria
[accessed 26 Jul, 2018]

15 https://www. slideshare

16 How are Phylogenies Estimated?
Input: Unaligned sequences (DNA, RNA, or AA) Output: Tree with sequences at leaves Standard approach uses two steps: (1) align (2) compute a tree on the alignment Many different techniques for each step The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

17 Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

18 Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA

19 Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

20 Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony
Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 20

21 1000-taxon models, ordered by difficulty (Liu et al., Science 2009)
1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000-taxon models, ordered by difficulty (Liu et al., Science 2009) 21

22 Estimate ML tree on merged alignment
Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

23 Estimate ML tree on merged alignment
Re-aligning on a tree A B D C A B Decompose dataset Algorithmic parameter: how to align subsets. Default: MAFFT L-INS-i. C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

24 SATé and PASTA Algorithms
Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

25 SATé-1 (Science 2009) performance
1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? SATé-1 24-hour analysis, on desktop machines (using MAFFT on subsets) (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences.

26 1000-taxon models ranked by difficulty
SATé-1 and SATé-2 (Systematic Biology, 2012) SATé-1: up to 8K SATé-2: up to ~50K 1000-taxon models ranked by difficulty

27 PASTA: better than SATé-1 and SATé-2

28 SATé and PASTA Algorithms
Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

29 Estimate ML tree on merged alignment
Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

30 PASTA: easy to use GUI

31 The Tutorial (by Mike Nute)
PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

32 This talk

33 Phylogenetic Placement
Input: Backbone alignment and backbone tree on full-length sequences, and a set of homologous query sequences (e.g., reads in a metagenomic sample for the same gene) Output: Placement of query sequences on backbone tree Note: if the backbone tree is a Taxonomy, then the placement gives taxonomic information about the query sequences (i.e., reads)! The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

34 Input S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

35 Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S1 S2 S3 S4

36 Place Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S1 S2 S3 S4 Q1

37 Phylogenetic Placement
The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

38 Marker-based Taxon Identification
Fragmentary sequences from some gene Full-length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

39 Phylogenetic Placement in 2011
Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)

40 HMMER vs. PaPaRa Alignments
0.0 Increasing rate of evolution

41 What is HMMER+pplacer? HMMER (Finn et al., NAR 2011) (specifically, HMMAlign) is used to add the read s into the backbone alignment, thus producing an “extended alignment”. HMMAlign is based on profile Hidden Markov Models (profile HMMs). pplacer (Matsen et al. BMC Bioinformatics 2010) is used to add read s into the best location in the tree T. pplacer is based on phylogenetic sequence evolution models (e.g., GTR), and uses maximum likelihood.

42

43 A general topology for a profile HMM
D: deletion state I: insertion state M: match state (correspond to sites in the alignment) Insertion and Match states emit letters (nucleotides, amino acids, other) from a distribution Edges have transition probabilities A path through the profile HMM (with random selection of letters from D and I states) generates a sequence Given a sequence, you can find the maximum likelihood path through the model in polynomial time (dynamic programming) From DOI: /ICPR

44 Profile Hidden Markov Models
47 Profile Hidden Markov Models Profile HMMs are probabilistic generative models to represent multiple sequence alignments. HMMER software suite can Build a profile HMM given a multiple sequence alignment A Use the profile HMM to add a sequence s into A, and return the “probability” that the HMM generated s (the “score”) Select between different profile HMMs based on score

45 Input Build a profile HMM for the backbone alignment
S1 S2 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.

46 Align Q1 using HMMER Build a profile HMM for the backbone alignment
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.

47 Align Q1 using HMMER Build a profile HMM for the backbone alignment
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment. Note the maximum likelihood score for the alignment!

48 Align Q1 using HMMER Build a profile HMM for the backbone alignment
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S3 S4 Build a profile HMM for the backbone alignment Compute a maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment. Note the maximum likelihood score for the alignment!

49 What is pplacer? pplacer: software developed by Erick Matsen and colleagues. See Input: read s, alignment A (on S and s), tree on S Output: “Best” location to add s in T (under maximum likelihood). For every edge e in T, the value p(e) for the probability for s being placed on e (these probabilities add up to 1)

50 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

51 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

52 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

53 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

54 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC 0.4 0.03 0.05 0.5 0.02 S3 S4 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

55 Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S3 S4 Q1 For every edge in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!) Return Te that has the best ML score.

56 HMMER vs. PaPaRa Alignments
0.0 Increasing rate of evolution

57 SEPP vs. HMMER, PaPaRa alignments
0.0 0.0 Increasing rate of evolution

58 One Hidden Markov Model for the entire alignment?
HMM 1

59 One HMM works beautifully for small-diameter trees

60 One HMM works poorly for large-diameter trees

61 One Hidden Markov Model for the entire alignment?

62 Or 2 HMMs? HMM 1 HMM 2

63 Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

64 SEPP Ensemble of HMMs (eHMMs)
Construct an eHMM, given an alignment A and tree T on A: Divide the leaves of T into subsets (by deleting centroid edges) until every subset is small enough Build a profile HMM on each subset using HMMER

65 SEPP Design To insert query sequence Q1 into backbone tree T
Represent the backbone MSA with an eHMM, based on maximum alignment subset size Score Q1 against every profile HMM in the collection The best scoring HMM is used to compute the extended alignment Use pplacer on the extended alignment to add Q1 into tree T (restricted to subtree based on maximum placement subset size)

66 SEPP Parameter Exploration
Alignment subset size and placement subset size impact the accuracy: Small alignment subset sizes best Large placement subset size best But running time and memory problems… Compromise 10% rule (both subset sizes 10% of backbone) had best overall performance

67 SEPP (10%-rule) on simulated data
0.0 0.0 Increasing rate of evolution

68 The Tutorial (by Mike Nute)
PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

69 TIPP https://github.com/smirarab/sepp
TIPP (Bioinformatics 2014) performs marker gene-based: taxonomic identification (what is this read?) and metagenomic abundance profiling (at some taxonomic level) TIPP uses PASTA (J. Comp. Biol. 2015) to compute large-scale multiple sequence alignments (one for each marker gene) and SEPP (Pacific Symposium on Biocomputing 2012) to add short reads into refined taxonomies (one for each marker gene)

70 High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

71 “Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

72 TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes (discards the others) For each marker: Computes PASTA reference alignments Computes reference taxonomies on the PASTA reference alignment Build eHMM for the PASTA reference alignment Places each read into the appropriate refined taxonomy, using a modification of SEPP (to consider statistical uncertainty in the extended alignment and placement within the refined taxonomy). Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

73 TIPP for Taxonomic ID – output file

74 54100 = NCBI taxon ID

75 TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

76 TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

77 TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

78 TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), MSA-based method that only characterizes those reads that map to the Metaphyler’s marker genes TIPP pipeline Uses BLAST to assign reads to marker genes For each marker: Computes PASTA reference alignments Computes reference taxonomies, refined to binary trees using reference alignment Computes eHMM on the PASTA reference alignment Modifies SEPP by considering statistical uncertainty in the extended alignment and placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

79 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

80 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

81 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

82 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

83 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

84 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

85 TIPP Design (Step 4) Input: marker gene reference alignment (computed using PASTA, RECOMB 2014), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%) For each marker gene, and its associated bin of reads: Builds eHMM to represent the MSA For each read: Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold. For each extended MSA, use pplacer to place the read into the taxonomy optimizing maximum likelihood and identify all the clades in the tree with sufficiently high likelihood to meet the specified placement support threshold. (Note – this will be a single clade if the support threshold is at strictly greater than 50%.) Taxonomically characterize each read at the MRCA of these clades.

86 The Tutorial (by Mike Nute)
PASTA for large-scale MSA and tree estimation SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights TIPP for taxon ID and abundance profiling

87 Using SEPP SEPP algorithmic parameters:
Alignment subset size (how many sequences for each profile HMM in the ensemble?) Placement subset size (how much of the tree to search for optimal placement?) Default settings are acceptable, but you can improve accuracy (but increase running time) by: increasing placement subset size and decreasing alignment subset size

88 Using TIPP TIPP algorithmic parameters (other than SEPP parameters)
Reference markers, alignments, and refined taxonomy Alignment threshold (default 95%) Placement threshold (default 95%) Note: The default alignment and placement thresholds were optimized for abundance profiling, not for Taxon ID. Reducing the placement threshold will increase probability of taxonomic classification at the species level (but could also increase the false positive rate)

89 Using PASTA Main algorithmic parameters in PASTA: Decomposition edge
Alignment subset size Subset aligner Alignment merger Tree estimator and ML model Number of iterations Note: type of data (AA or nucleotide) affects subset alignment method (e.g., Muscle is particularly bad choice for AA but not too bad for DNA, MAFFT L-INS-i among best for both) Ask Mike about using BAli-Phy (Bayesian alignment estimation method) within PASTA

90 TIPP is under development!
We are modifying TIPP’s design to improve taxonomic identification and abundance profiling on shotgun sequencing data Stay tuned! Developers: Erin Molloy, Mike Nute, Nidhi Shah, Mihai Pop, and Tandy Warnow

91 Profile HMMs vs. eHMMs An eHMM is better able to:
detect homology between full length sequences and fragmentary sequences add fragmentary sequences into an existing alignment especially when there are many indels and/or substitutions (e.g., in the twilight zone)

92 Our Publications using eHMMs
S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17: N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24): N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765 All codes are available in open source form at

93 Acknowledgments PhD students: Nam Nguyen (now postdoc at UCSD), Siavash Mirarab (now faculty at UCSD), Bo Liu (now at Square), Erin Molloy, Nidhi Shah (Maryland), and Mike Nute Mihai Pop, University of Maryland NSF grants to TW: DBI: , DEB , III:AF: NIH grant to MP: R01-A Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources


Download ppt "TIPP and SEPP (plus PASTA)"

Similar presentations


Ads by Google