Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois at Urbana-Champaign
Human Microbiome 10 times more bacteria cells than human cells Important role in regulating health Disruption associated with risk factors for diseases
Metagenomics Analyzing DNA sequences from environmental sample Sequencing technology produces short fragments of DNA Typical datasets contain millions of reads
Phylogenetic pipeline Hu = AGGCTATCACCGACTCCA Ch = TAGCTATCACGACCGC Go = TAGCTGACCGC Or = TCACGACCGACA Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = TCACGACCGACA
Using the MSA and tree to identify reads Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = TCACGACCGACA Qu = TCACCCCQu = TCACC-CC---- Q Represent MSA using a profile Hidden Markov Model (HMM)
Phylogenetic Placement Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011)
Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) Phylogenetic Placement
HMMER and PaPaRa results Increasing rate evolution 0.0 Backbone size: fragments 20 replicates
HMM 1 Standard approach (single HMM) Large evolutionary diameter
New approach Smaller evolutionary diameter HMM 1 HMM 2 HMM 1 HMM 3 HMM 4 HMM 2
HMM 1 HMM 3 HMM 4 HMM 2 Ensemble of HHMs (eHMMs)
SEPP (10% rule) Simulated Results 0.0 Increasing rate evolution Backbone size: fragments 20 replicates
Summary so far Use DNA sequences to build an MSA and tree Use an existing MSA and tree to identify a sequence eHMMs for aligning a sequence to an existing MSA
Metagenomic taxon identification Objective: classify short reads in a metagenomic sample
Abundance profiling Objective: distribution of the species (or genera, or families, etc.) within the sample For example, the distribution of a sample at the species level might be: Species A: 10% Species B: 25% Species C: 55% Species D: 1% Species E: 9%
A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Genome-based profiling True profile: 67% A, 33% B Profile estimated from reads: 50% A, 50%B
A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Single copy marker-based profiling True profile: 67% A, 33% B Profile estimated from reads: 67% A, 33%B Each have a single copy of gene C
TIPP: Taxonomic Identification and Phylogenetic Profiling ACT..TAGAA (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT … ACCT Fragmentary unknown reads for a gene Known full length sequences for a gene, and an alignment and a tree
Marker genes Nguyen et al., Bioinformatics, 2014 TIPP: Taxonomic Identification and Phylogenetic Profiling Reads Assign to marker genes Classify reads Compute profile
Abundance profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: –PhymmBL (Brady & Salzberg, Nature Methods 2009) –NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) –MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland –MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard –mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).
Note: NBC, MetaPhlAn, and Metaphyler cannot classify any sequences from at least of the high indel long sequence datasets. mOTU terminates with an error message on all the high indel datasets. “Hard” genome datasets (known genomes and high indel error)
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets. “Novel” genome datasets
TIPP compared to other profiling methods TIPP is highly accurate, even in the presence of novel genomes and high sequencing error All other methods are less robust Accurate profiles can be estimated using only a portion of the reads
Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…
Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…
Real biological data is messy Full-length P450 gene ~500 amino acid residues Total sequences before filtering ~225K Challenge: How do we align large datasets with fragmentary sequences?
HMMs for MSA Given seed alignment and a collection of sequences for the protein family: Represent seed alignment using a profile HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Drawbacks: Requires seed alignment Poor accuracy on evolutionarily divergent datasets
Old approach using single HMM HMM 1
SEPP/TIPP approach HMM 1 HMM 3 HMM 4 HMM 2
How small of a subset size do we go to? HMM 1 HMM 3 HMM 4 HMM 2
Keep all HMMs HMM 1
HMM 2 HMM 3 Keep all HMMs
m HMM 2 HMM 3 HMM 1 HMM 4 HMM 5 HMM 6 HMM 7 Nested Hierarchical Ensemble of HMMs
UPP: Ultra-large alignment using phylogeny aware profiles
5/14/14 Experimental Design Examined both simulated and biological DNA, RNA, and AA datasets Generated fragmentary datasets from the full-length datasets Explored impact of algorithmic design Compared Clustal-Omega, Mafft, Muscle, PASTA, and UPP ML trees estimated on alignments Scored alignment and tree error Alignment error measured as average of SPFN and SPFP Tree error measured in FN rate or Delta FN rate
UPP Algorithmic Parameters Decompose or not? Use an ensemble of HMMs or just a single HMM? Use a small (100 sequence) or large (1000 sequence) backbone?
RNASim Alignment Error
5/14/14 Full-length datasets
Alignment error on fragmentary 16S.T
5/14/14 Fragmentary datasets
Running time on simulated RNA datasets UPP has close to linear runtime scaling
UPP compared to other alignment methods PASTA and UPP result in accurate alignments and trees on full-length sequences (PASTA slightly more accurate trees) UPP is more robust on fragmentary data Using combination of UPP+PASTA can give best overall result
Summary Ensemble of HMMs TIPP for identification and profiling UPP for ultra-large alignment
Acknowledgements Illinois Tandy Warnow Rebecca Stumpf Bryan White Mike Nute Brenda Wilson UCSD Siavash Mirarab UMD Mihai Pop Bo Liu U of Copenhagen Alonzo Alfaro-Núñez Tom Hansen Anders Hansen Funding NSF NSF NSF University of Alberta
Questions? Available at