The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Isaam Saeed & Saman K Halgamuge MERIT,

Slides:



Advertisements
Similar presentations
Metabarcoding 16S RNA targeted sequencing
Advertisements

Computational Analysis of the Taxanomical Classification of Short 16S rRNA Sequences Christel Chehoud Mentor: Brian Haas.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey What is Metagenomics?  Traditional microbial genomics 
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Influence of a Wastewater Treatment Plant on Functional Characteristics of Microbial Communities Matthew S. Luckenbaugh Department of Biological Sciences,
Microbial Diversity.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
Introduction to metagenomics Agnieszka S. Juncker Center for Biological Sequence Analysis Technical University of Denmark.
Metagenomics Binning and Machine Learning
Microbial taxonomy and phylogeny
Molecular Microbial Ecology
Todd J. Treangen, Steven L. Salzberg
Species  OTUs  OPUs  Species  OTUs  OPUs. Rosselló-Mora & Amann 2001, FEMS Rev. 25:39-67 Taxa circumscription depends on the observable characters.
Development and Evaluation of a Comprehensive Functional Gene array for Environmental Studies Zhili He 1,2, C. W. Schadt 2, T. Gentry 2, J. Liebich 3,
Probes can be designed in an evolutionary hierarchy.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
3- NON-RIBOSOMAL GENE RECONSTRUCTION  Core / auxiliary / strain specific genes  Housekeeping genes and accordance with global reconstruction  MLSA 
 16S rRNA gene marker  intra-gene variability  primer selection  size & information content Primer selection, information content, alignment and length.
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Big Picture Of ≈1.7 million species classified so far, roughly 6000 are microbes True number of microbes is obviously larger than 6000 “Imagine if our.
Microbial communities of uranium contaminated groundwater : Metagenomic insights Group members: Pinaki Sar Ratul Saha Sufia Kazy.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Tsute (George) Chen Bioinformatics Core Department of Microbiology The Forsyth Institute March 24 th, 2015 HOMD A Tour to the Data and Tools.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Sequencing the World of Possibilities for Energy & Environment.
Habitat-Lite & EnvO Jin Mao Postdoc, School of Information, University of Arizona Nov. 20, 2015.
Metagenomic dataset preprocessing – data reduction
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Canadian Bioinformatics Workshops
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9):
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Structural genomics includes the genetic mapping, physical mapping and sequencing of entire genomes.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
University of Bucharest Collage of Engineering
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Metagenomic Species Diversity.
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Microbial Taxonomy and the Evolution of Diversity
Metagenomic assembly Cedric Notredame
Research in Computational Molecular Biology , Vol (2008)
Mining bacterial genomes for laccases
Denaturing Gradient Gel Electrophoresis
Workshop on the analysis of microbial sequence data using ARB
Metagenomics Image: Iverson et al. 2012, Science.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Taxonomic identification and phylogenetic profiling
Introduction to Sequencing
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Genome resolved metagenomics
Toward Accurate and Quantitative Comparative Metagenomics
General overview of the bioinformatic pipelines for the 16S rRNA gene microbial profiling and shotgun metagenomics. General overview of the bioinformatic.
Presentation transcript:

The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Isaam Saeed & Saman K Halgamuge MERIT, Biomedical Engineering Melbourne School of Engineering

Outline What is metagenomics? Introducing OFDEG Application to metagenomics Benchmarking results Concluding remarks

Metagenomics: a brief introduction Environmental niches Microorganisms working together as a community Metagenomics is relatively recent... Dealing primarily with... These microorganims work together and interact...that are NECCESSARY As an example consider soil. Now it may seem MUNDANE but it is one of the MOST COMPLEX . What makes this SO CALLED interesting is that they exist in harsh and extreme environments, such as In harnessing the knowledge of how these EXIST and FUNCTIONS, we can expand our knowledge BIOSPHERE, BIOTECH Before we can perform detailed analysis such as reconstructing metabolic pathways or investigating their biogeochemistry, we need to ask two fundamental questions Example: Nitrogen fixation in soil

Metagenomics: a brief introduction (cont’d) Isolate each constituent organism in pure culture clone  sequence  analyse clone  sequence  analyse Early attempts at ANALYSING ... RELIED clone  sequence  analyse ! BUT, we only know about laboratory culturing methods for ~1% of extant microbiota Modified and adapted from: Keller, M. & Zengler, K.: Tapping into microbial diversity. Nature Reviews Microbiology: 2, 141-150 (February 2004)

Novel microbes and the binning problem Metagenomics approach Binning Conserved marker genes * high accuracy * low coverage Sequence similarity * very short sequences * computationally intensive * biased Sequence composition * unbiased (?) * long sequence length How do we handle novel microbes that resits lab CULTIVATION? So now we arrive at the metagenomics approach SO we have our environmental sample or microoragisms Once we have extracted CONTAINED, we BLINDLY The question then becomes

Sequence composition: oligonucleotide frequency (OF) Pride D, Meinersmann R, Wassenaar T.: Evolutionary Implications of Microbial Genome Tetranucleotide Frequency Biases. Genome Research 2003, 13:145-158. Teeling H, Meyerdierks A, Bauer M, Amann R, Glockner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 2004, 6(9):938-47

The oligonulceotide frequency derived error gradient (OFDEG) Sample, i, of length l No l = l + step.size Linear regression OFDEG compute OF profiles Yes samples ≥ N

OFDEG in relation to microbial phylogeny Family: Enterobacteriaceae Family: Xanthomonadaceae Class: Gammaproteobacteria

Benchmarking procedure: metagenomic data simLC: biophosphorus removing sludge Dominant species: Rhodopseudomonas palustris HaA2 strain Coverage: 5.19x simMC: acid mine drainage biofilm Xylella fastidiosa Dixon Rhodopseudomonas palustris BisB5 Bradyrhizobium sp. BTAi1 Coverage: 3.48 to 2.77x simHC: agricultural soil Dominant Species: none Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, et. al.: Use of simulated data sets to evaluate the delity of metagenomic processing methods. Nature Methods 2007, 4(6):495-500.

Benchmarking procedure: assemblers simMC contigs ≥ 8,000 bp Phrap 8000 bp* Arachne major contigs 230 bp* 1334 bp* * Cutoff length

Benchmarking procedure: algorithms simMC contigs ≥ 8,000 bp Phrap 8000 bp U* SS* Arachne major contigs 230 bp 1334 bp For: - Tetranucleotide Frequency (TF) - OFDEG - OFDEG + GC Content * U – unsupervised SS – semi-supervised

Benchmarking procedure: algorithms Unsupervised: i.e. Partitioning about Mediods (PAM) Silhouette width governs optimal class selection Semi-supervised: SGSOM1 Based on Self-organising Maps Cluster-then-label strategy Labels (“seeds”): Upstream/downstream flanking sequences of 16S rRNA gene, subject to selection criteria CP set at 55% and 75% as per recommendations 1Chan CKK, Hsu A, Halgamuge SK, Tang SL: Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 2008, 9(215)

Benchmarking procedure: accuracy Taxonomy definition: NCBI All results taken at the rank of Order Standard definitions of Sensitivity: TP / (TP + FN) Specificity: TN / (TN + FP) Bins containing predominantly one organism considered reference bin, i.e. TP’s. SS accuracy measured based on assigned label vs actual label. Domain: Bacteria Phylum: Proteobacteria Class: Gammaproteobacteria Order: Xanthomonadales Family: Xanthomonadaceae Genus: Xylella Species: Xylella fastidiosa Strain: Xylella fastidiosa Dixon

Results: overall comparison Feature Algorithm Type* Assigns. (%) Spec. Sens. Disc. Ability TF U 97.33 0.9905 0.6565 0.8235 OFDEG 97.32 0.9100 0.8300 0.8700 TF (CP=55%) SS 69.28 1.0000 0.7450 0.8725 OFDEG+GC (CP=75%) 77.75 0.8000 0.9625 0.8813 TF (CP=75%) 83.44 0.9925 0.8925 0.9425 OFDEG+GC 0.9513 0.9525 0.9519 OFDEG+GC (CP=55%) 63.65 0.9400 0.9950 0.9675 * U – Unsupervised SS – Semi-supervised

Conclusions Novel representation of short DNA sequence Increase in binning fidelity vs TF Need to break away from single genomes assemblers Development of composition-based assignment in the right direction More beneficial than developing intricate ML algorithms Potentially captures phylogenetic signal Still in its early stages: Theoretical framework (?) True biological meaning (?)

Thank you. Questions?

Results: at least 8,000bp in length

Results: at least 8,000bp in length

Results: contigs composed of at least 10 reads

Results: contigs composed of at least 10 reads