Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Slides:



Advertisements
Similar presentations
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Advertisements

Structural bioinformatics
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Recent breakthroughs in mathematical and computational phylogenetics
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Finding new nirK genes in metagenomic data
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
Expected accuracy sequence alignment Usman Roshan.
(H)MMs in gene prediction and similarity searches.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
TIPP: Taxonomic Identification And Phylogenetic Profiling
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Advances in Ultra-large Phylogeny Estimation
Taxonomic distribution of large DNA viruses in the sea
Chalk Talk Tandy Warnow
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Large-Scale Multiple Sequence Alignment
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
CS 394C: Computational Biology Algorithms
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Scaling Species Tree Estimation to Large Datasets
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois at Urbana-Champaign

Human Microbiome 10 times more bacteria cells than human cells Important role in regulating health Disruption associated with risk factors for diseases

Metagenomics Analyzing DNA sequences from environmental sample Sequencing technology produces short fragments of DNA Typical datasets contain millions of reads

Phylogenetic pipeline Hu = AGGCTATCACCGACTCCA Ch = TAGCTATCACGACCGC Go = TAGCTGACCGC Or = TCACGACCGACA Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = TCACGACCGACA

Using the MSA and tree to identify reads Hu = -AGGCTATCACGACCTCCA Ch = TAG-CTATCACGACCGC-- Go = TAG-CT-----GACCGC-- Or = TCACGACCGACA Qu = TCACCCCQu = TCACC-CC---- Q Represent MSA using a profile Hidden Markov Model (HMM)

Phylogenetic Placement Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011)

Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011) Phylogenetic Placement

HMMER and PaPaRa results Increasing rate evolution 0.0 Backbone size: fragments 20 replicates

HMM 1 Standard approach (single HMM) Large evolutionary diameter

New approach Smaller evolutionary diameter HMM 1 HMM 2 HMM 1 HMM 3 HMM 4 HMM 2

HMM 1 HMM 3 HMM 4 HMM 2 Ensemble of HHMs (eHMMs)

SEPP (10% rule) Simulated Results 0.0 Increasing rate evolution Backbone size: fragments 20 replicates

Summary so far Use DNA sequences to build an MSA and tree Use an existing MSA and tree to identify a sequence eHMMs for aligning a sequence to an existing MSA

Metagenomic taxon identification Objective: classify short reads in a metagenomic sample

Abundance profiling Objective: distribution of the species (or genera, or families, etc.) within the sample For example, the distribution of a sample at the species level might be: Species A: 10% Species B: 25% Species C: 55% Species D: 1% Species E: 9%

A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Genome-based profiling True profile: 67% A, 33% B Profile estimated from reads: 50% A, 50%B

A A B Population of 2 bacteria, A and B. B has twice as large genome as A. Single copy marker-based profiling True profile: 67% A, 33% B Profile estimated from reads: 67% A, 33%B Each have a single copy of gene C

TIPP: Taxonomic Identification and Phylogenetic Profiling ACT..TAGAA (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT … ACCT Fragmentary unknown reads for a gene Known full length sequences for a gene, and an alignment and a tree

Marker genes Nguyen et al., Bioinformatics, 2014 TIPP: Taxonomic Identification and Phylogenetic Profiling Reads Assign to marker genes Classify reads Compute profile

Abundance profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: –PhymmBL (Brady & Salzberg, Nature Methods 2009) –NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) –MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland –MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard –mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).

Note: NBC, MetaPhlAn, and Metaphyler cannot classify any sequences from at least of the high indel long sequence datasets. mOTU terminates with an error message on all the high indel datasets. “Hard” genome datasets (known genomes and high indel error)

Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets. “Novel” genome datasets

TIPP compared to other profiling methods TIPP is highly accurate, even in the presence of novel genomes and high sequencing error All other methods are less robust Accurate profiles can be estimated using only a portion of the reads

Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…

Ensemble of HMMs Represent MSA using many HMMs Modifications enable –Fast and accurate alignment of fragmentary and ultra-large datasets (Nguyen et al., Genome Biology 2015 ) –Improved protein homology detection (in preparation) Currently in use for –Vertebrate nuclear receptor evolution (in preparation) –1KP Plant phylogenomics study (in preparation) –Identification of cardioviruses in rats (in preparation) –Identification of microbial sample (in preparation) –and many others…

Real biological data is messy Full-length P450 gene ~500 amino acid residues Total sequences before filtering ~225K Challenge: How do we align large datasets with fragmentary sequences?

HMMs for MSA Given seed alignment and a collection of sequences for the protein family: Represent seed alignment using a profile HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Drawbacks: Requires seed alignment Poor accuracy on evolutionarily divergent datasets

Old approach using single HMM HMM 1

SEPP/TIPP approach HMM 1 HMM 3 HMM 4 HMM 2

How small of a subset size do we go to? HMM 1 HMM 3 HMM 4 HMM 2

Keep all HMMs HMM 1

HMM 2 HMM 3 Keep all HMMs

m HMM 2 HMM 3 HMM 1 HMM 4 HMM 5 HMM 6 HMM 7 Nested Hierarchical Ensemble of HMMs

UPP: Ultra-large alignment using phylogeny aware profiles

5/14/14 Experimental Design Examined both simulated and biological DNA, RNA, and AA datasets Generated fragmentary datasets from the full-length datasets Explored impact of algorithmic design Compared Clustal-Omega, Mafft, Muscle, PASTA, and UPP ML trees estimated on alignments Scored alignment and tree error Alignment error measured as average of SPFN and SPFP Tree error measured in FN rate or Delta FN rate

UPP Algorithmic Parameters Decompose or not? Use an ensemble of HMMs or just a single HMM? Use a small (100 sequence) or large (1000 sequence) backbone?

RNASim Alignment Error

5/14/14 Full-length datasets

Alignment error on fragmentary 16S.T

5/14/14 Fragmentary datasets

Running time on simulated RNA datasets UPP has close to linear runtime scaling

UPP compared to other alignment methods PASTA and UPP result in accurate alignments and trees on full-length sequences (PASTA slightly more accurate trees) UPP is more robust on fragmentary data Using combination of UPP+PASTA can give best overall result

Summary Ensemble of HMMs TIPP for identification and profiling UPP for ultra-large alignment

Acknowledgements Illinois Tandy Warnow Rebecca Stumpf Bryan White Mike Nute Brenda Wilson UCSD Siavash Mirarab UMD Mihai Pop Bo Liu U of Copenhagen Alonzo Alfaro-Núñez Tom Hansen Anders Hansen Funding NSF NSF NSF University of Alberta

Questions? Available at