Download presentation
Presentation is loading. Please wait.
Published byLeon Bond Modified over 8 years ago
2
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois at Urbana-Champaign
3
Human Microbiome 10 times more bacteria cells than human cells Important role in regulating health Disruption associated with risk factors for diseases
4
Metagenomics Analyzing DNA sequences from environmental sample Sequencing technology produces short fragments of DNA Typical datasets contain millions of reads
5
Phylogenetic pipeline Hu = AGGCTATCACCGACTCCA Ch = TAGCTATCACGACCGC Go = TAGCTGACCGC Or = TCACGACCGACA Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = -------TCACGACCGACA
6
Using the MSA and tree to identify reads Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = -------TCACGACCGACA Qu = TCACCCCQu = -------TCACC-CC---- Q Represent MSA using a profile Hidden Markov Model (HMM)
7
Phylogenetic Placement Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011)
8
Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) Phylogenetic Placement
9
HMMER and PaPaRa results Increasing rate evolution 0.0 Backbone size: 500 5000 fragments 20 replicates
10
HMM 1 Standard approach (single HMM) Large evolutionary diameter
11
New approach Smaller evolutionary diameter HMM 1 HMM 2 HMM 1 HMM 3 HMM 4 HMM 2
12
HMM 1 HMM 3 HMM 4 HMM 2 Ensemble of HHMs (eHMMs)
13
SEPP (10% rule) Simulated Results 0.0 Increasing rate evolution Backbone size: 500 5000 fragments 20 replicates
14
Summary so far Use DNA sequences to build an MSA and tree Use an existing MSA and tree to identify a sequence eHMMs for aligning a sequence to an existing MSA
15
Metagenomic taxon identification Objective: classify short reads in a metagenomic sample
16
Abundance profiling Objective: distribution of the species (or genera, or families, etc.) within the sample For example, the distribution of a sample at the species level might be: Species A: 10% Species B: 25% Species C: 55% Species D: 1% Species E: 9%
17
A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Genome-based profiling True profile: 67% A, 33% B Profile estimated from reads: 50% A, 50%B
18
A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Single copy marker-based profiling True profile: 67% A, 33% B Profile estimated from reads: 67% A, 33%B Each have a single copy of gene C
19
TIPP: Taxonomic Identification and Phylogenetic Profiling ACT..TAGAA (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT … ACCT Fragmentary unknown reads for a gene Known full length sequences for a gene, and an alignment and a tree
20
Marker genes Nguyen et al., Bioinformatics, 2014 TIPP: Taxonomic Identification and Phylogenetic Profiling Reads Assign to marker genes Classify reads Compute profile
21
Abundance profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: –PhymmBL (Brady & Salzberg, Nature Methods 2009) –NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) –MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland –MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard –mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).
22
Note: NBC, MetaPhlAn, and Metaphyler cannot classify any sequences from at least of the high indel long sequence datasets. mOTU terminates with an error message on all the high indel datasets. “Hard” genome datasets (known genomes and high indel error)
23
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets. “Novel” genome datasets
24
TIPP compared to other profiling methods TIPP is highly accurate, even in the presence of novel genomes and high sequencing error All other methods are less robust Accurate profiles can be estimated using only a portion of the reads
25
Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…
26
Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…
27
Real biological data is messy Full-length P450 gene ~500 amino acid residues Total sequences before filtering ~225K Challenge: How do we align large datasets with fragmentary sequences?
28
HMMs for MSA Given seed alignment and a collection of sequences for the protein family: Represent seed alignment using a profile HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Drawbacks: Requires seed alignment Poor accuracy on evolutionarily divergent datasets
29
Old approach using single HMM HMM 1
30
SEPP/TIPP approach HMM 1 HMM 3 HMM 4 HMM 2
31
How small of a subset size do we go to? HMM 1 HMM 3 HMM 4 HMM 2
32
Keep all HMMs HMM 1
33
HMM 2 HMM 3 Keep all HMMs
34
m HMM 2 HMM 3 HMM 1 HMM 4 HMM 5 HMM 6 HMM 7 Nested Hierarchical Ensemble of HMMs
35
UPP: Ultra-large alignment using phylogeny aware profiles
41
5/14/14 Experimental Design Examined both simulated and biological DNA, RNA, and AA datasets Generated fragmentary datasets from the full-length datasets Explored impact of algorithmic design Compared Clustal-Omega, Mafft, Muscle, PASTA, and UPP ML trees estimated on alignments Scored alignment and tree error Alignment error measured as average of SPFN and SPFP Tree error measured in FN rate or Delta FN rate
42
UPP Algorithmic Parameters Decompose or not? Use an ensemble of HMMs or just a single HMM? Use a small (100 sequence) or large (1000 sequence) backbone?
43
RNASim Alignment Error
44
5/14/14 Full-length datasets
45
Alignment error on fragmentary 16S.T
46
5/14/14 Fragmentary datasets
47
Running time on simulated RNA datasets UPP has close to linear runtime scaling
48
UPP compared to other alignment methods PASTA and UPP result in accurate alignments and trees on full-length sequences (PASTA slightly more accurate trees) UPP is more robust on fragmentary data Using combination of UPP+PASTA can give best overall result
49
Summary Ensemble of HMMs TIPP for identification and profiling UPP for ultra-large alignment
50
Acknowledgements Illinois Tandy Warnow Rebecca Stumpf Bryan White Mike Nute Brenda Wilson UCSD Siavash Mirarab UMD Mihai Pop Bo Liu U of Copenhagen Alonzo Alfaro-Núñez Tom Hansen Anders Hansen Funding NSF 09-35347 NSF 08-20709 NSF 0733029 University of Alberta
51
Questions? Available at https://github.com/smirarab/sepp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.