Presentation is loading. Please wait.

Presentation is loading. Please wait.

Taxonomic identification and phylogenetic profiling

Similar presentations


Presentation on theme: "Taxonomic identification and phylogenetic profiling"— Presentation transcript:

1 Taxonomic identification and phylogenetic profiling
Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois at Urbana-Champaign Joint work with Siavash Mirarab, Mihai Pop, and Tandy Warnow

2 Metagenomics Culture-independent method for studying a microbiome
Extract genetic material directly from the environment Applications to biofuel production, agriculture, human health Sequencing technology produces millions of short reads from unknown species Fundamental steps in analysis is identifying taxa of read and estimating a population profile of a sample Metagenomics is the study of sampling genetic material directly from an environmental sample Courtesy of Human Microbiome Project

3 Taxonomic Identification and Profiling
Objective: Given a query sequence, identify the taxon (species, genus, family, etc...) of the sequence Classification problem Taxonomic profiling Objective: Given a set of query sequences collected from a sample, estimate the population profile of the sample Estimation problem Can be solved via taxonomic identification The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignment, then aligns query sequence. PaPaRa, which estimates ancestral parsimony state vectors for each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

4 Taxonomic Identification Methods
Sequence similarity search Classifies by finding most similar sequence Classifies fragments from any region of genome BLAST Composition-based methods Typically uses k-mers PhymmBL, NBC Phylogeny-based methods Classifies fragments by using a phylogeny The human body contains 10 times more bacterial cells than human cells

5 Phylogeny-based taxonomic identification
5 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

6 Phylogeny-based taxonomic identification
6 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

7 Phylogeny-based taxonomic identification
7 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

8 Phylogeny-based taxonomic identification
8 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

9 Phylogeny-based taxonomic identification
9 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

10 Phylogeny-based taxonomic identification
10 Phylogeny-based taxonomic identification Known full-length gene sequences, and an alignment and a tree Fragmentary Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT Remove internal nodes TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

11 Phylogenetic Placement
Input: (Backbone) Alignment and tree on full-length sequences and a query sequence (short read) Output: Placement of the query sequence on the backbone tree Use placement to infer relationship between query sequence and full-length sequences in backbone tree Applications in metagenomic analysis Millions of reads Reads from different genomes mixed together Use placement to identify read Metagenomic analysis, include a slide

12 Align Sequence S1 S2 S4 S3 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S2 S4 S3

13 Align Sequence S1 S2 S4 S3 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S2 S4 S3

14 Place Sequence S1 S2 S4 S3 S1 = -AGGCTATCACCTGACCTCCA-AA
S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S2 S4 Q1 S3

15 Phylogenetic Placement
Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignment, then aligns query sequence. PaPaRa, which estimates ancestral parsimony state vectors for each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

16 Phylogenetic Placement
Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignment, then aligns query sequence. PaPaRa, which estimates ancestral parsimony state vectors for each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

17 HMMER and PaPaRa results
Backbone size: 500 5000 fragments 20 replicates 0.0 Increasing rate evolution

18 Old approach using single HMM
Leaves

19 Old approach using single HMM

20 Old approach using single HMM
Large evolutionary diameter

21 New approach

22 New approach

23 New approach Smaller evolutionary diameter

24 New approach HMM 1 HMM 2

25 New approach HMM 1 HMM 2 HMM 3 HMM 4
the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

26 SEPP (10% rule) Simulated Results
Backbone size: 500 5000 fragments 20 replicates 0.0 0.0 Increasing rate evolution

27 Using SEPP Known Full length Sequences, Fragmentary Unknown Reads:
27 Using SEPP Known Full length Sequences, and an alignment and a tree Fragmentary Unknown Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT ML placement 40% The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestral parsimony state vectors for each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

28 Taxonomic Identification using Phylogenetic Placement
28 Taxonomic Identification using Phylogenetic Placement Adding Uncertainty Known Full length Sequences, and an alignment and a tree Fragmentary Unknown Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT 2nd highest likelihood placement 38% ML placement 40% Motivate the placement with numerical numbers 95% TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

29 Taxonomic Identification using Phylogenetic Placement
29 Taxonomic Identification using Phylogenetic Placement Adding Uncertainty Known Full length Sequences, and an alignment and a tree Fragmentary Unknown Reads: ( bp long) (500-10,000 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG . ACCT 2nd highest likelihood placement 38% ML placement 40% Motivate the placement with numerical numbers 95% TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5) AGG...GCAT (species1) TAGC...CCA (species2)

30 TIPP Nguyen et al. Bioinformatics 2014

31 TIPP for Taxonomic Profiling
Motivate 31 TIPP for Taxonomic Profiling Marker-based abundance profiler Uses a collection of single copy housekeeping genes Only fragments binned to marker genes classified Profiling algorithm Bins fragments to marker genes Classify fragments binned to each marker Pool all classified reads Estimate abundance profile on pooled reads

32 Taxonomic Profiling Experimental Design
5/14/14 Taxonomic Profiling Experimental Design Datasets Easy conditions (low error rates, known genomes) Hard conditions (novel genomes, high error rates) Methods Marker-based – TIPP, Metaphyler, mOTU, Metaphlan Genome-based – NBC, PhymmBL Measured distance to true profile as error metric

33 “Easy” genome datasets

34 High indel datasets containing known genomes
“Hard” genome datasets High indel datasets containing known genomes Note: NBC, MetaPhlAn, and Metaphyler cannot classify any sequences from at least of the high indel long sequence datasets. mOTU terminates with an error message on all the high indel datasets.

35 “Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

36 Summary TIPP: marker-based taxonomic identification and classification method through phylogenetic placement Very robust to sequencing errors and novel genomes Results in overall more accurate profiles Accurate profiles can be obtained by classifying reads from the marker genes

37 Acknowledgements Siavash Mirarab Mihai Pop Tandy Warnow Bo Liu
Supported by NSF DEB University of Alberta

38 SEPP/TIPP/UPP SEPP/UPP/TIPP site: https://github.com/smirarab/sepp/
Instructions for installing UPP: Instructions for installing TIPP: References: 1) N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow. Ultra-large alignments using phylogeny-aware profiles, Proceedings of Research in Computational Biology (RECOMB) 2015 and to appear in Genome Biology 2015. 1) N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow. TIPP:Taxonomic Identification and Phylogenetic Profiling. Bioinformatics, 2014, 30 (24): 2) Mirarab, S., N. Nguyen, and T. Warnow, SEPP: SATe-Enabled Phylogenetic Placement. Pacific Symposium on Biocomputing.

39 Phylogenetic Placement
Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignment, then aligns query sequence. PaPaRa, which estimates ancestral parsimony state vectors for each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

40 16S Identification A A A A B B 16S gene

41 16S Identification True Abundance A: 67% B: 33% A A
Estimated Abundance A: 50% B: 50% A A B B 16S gene

42 Single copy gene True Abundance A: 67% B: 33% A A Estimated Abundance


Download ppt "Taxonomic identification and phylogenetic profiling"

Similar presentations


Ads by Google