Metagenomics Binning and Machine Learning

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Data Mining Classification: Alternative Techniques
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Metabarcoding 16S RNA targeted sequencing
Next Generation Sequencing, Assembly, and Alignment Methods
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey What is Metagenomics?  Traditional microbial genomics 
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Ensemble Learning: An Introduction
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
Automatic methods for functional annotation of sequences Petri Törönen.
Metagenomic Analysis Using MEGAN4
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
The iPlant Collaborative
Chapter 21 Eukaryotic Genome Sequences
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Canadian Bioinformatics Workshops
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9):
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
bacteria and eukaryotes
Canadian Bioinformatics Workshops
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Metagenomic assembly Cedric Notredame
Research in Computational Molecular Biology , Vol (2008)
Unraveling the microbial profile of the rhizosphere of SDS-suppressive soils in Soybean fields Ali Y. Srour1, Jason Bond1, Leonor Leandro2, Dean Malvick3.
Metagenomics Image: Iverson et al. 2012, Science.
H = -Σpi log2 pi.
Basic Local Alignment Search Tool (BLAST)
Taxonomic identification and phylogenetic profiling
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
Evaluating Classifiers for Disease Gene Discovery
Overview of Shotgun Sequence Analysis
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Metagenomics Binning and Machine Learning By: Abdulrhman Aljouie

Introduction Image source: http://www.mgrc.com.my/ Metagenomics defined by National Human Genome Research Institute as “the study of a collection of genetic material (genomes) from a mixed community of organisms”.  Usually refers to the study of microbial communities. Facts: Single gram of soil contains between 5,000 and 40,000 species (conservatively defined) of microbes. This community is one of the most difficult to study because the vast majority of members — probably more than 99.9 percent — cannot be cultured by existing methods (Handelsman LAB, retrieved 4/2015) Bacteria responsible for 25% deaths. (Handelsman, 2005). Antibacterial Discovery Void: no major classes of antibiotics were introduced between 1962 and 2000 and refers to the interim as an innovation gap (Lynn L., 2011).

Steps of Seq-based metagenomics project Fig source: Thomas et al. Microbial Informatics and Experimentation 2012, 2:3 Steps of Seq-based metagenomics project DNA extraction:  process of purification of DNA from sample using a combination of physical and chemical methods. (e.g. removing membrane lipids, protein, RNA). DNA Sequencing: determining the order of nucleotides within DNA molecule (i.e. A G C T). DNA sequencers example (Illumina HiSeq 2000).

Cont. Steps of Seq-based metagenomics project Fig source: Thomas et al. Microbial Informatics and Experimentation 2012, 2:3 Cont. Steps of Seq-based metagenomics project DNA Assembly: merging DNA fragments to reconstruct original sequence. There are two methods: De novo: short-reads transcriptome assembly e.g. Trans- ABySS. Mapping: assembly against reference genome e.g. BWA. Annotation: identify genes location e.g. RAST.

Binning Metagenomics binning is a classification of the sequence data according to the contributing species. Algorithms developed utilizing two types of information within the DNA sequence: similarity-based binning. Compositional binning algorithms.

Similarity-based binning The unknown DNA fragment might encode for a gene and the similarity of this gene with known genes in a reference database can be used to classify and hence bin the sequence (Metagenomics - a guide from sampling to data analysis, 2012).

Example of Similarity-based binning MEGAN: a new computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI taxonomy to summarize and order the results (Daniel et al. , 2007). 

MEGAN loads the complete NCBI taxonomy, currently containing >280,000 taxa, which can then be interactively explored using customized tree- navigation features. However, the main application of MEGAN is to process the results of a comparison of reads against a database of known sequences. The program parses files generated by BLASTX, BLASTN, or BLASTZ, and saves the results as a series of read–taxon matches in a program-specific metafile. (Additional parsers may be added to process the results generated by other sequence comparison methods.) The program assigns reads to taxa using the LCA (Lowest common ancestor) algorithm and then displays the induced taxonomy. Nodes in the taxonomy can be collapsed or expanded to produce summaries at different levels of the taxonomy.

Compositional binning algorithms compositional binning makes use of the fact that genomes have conserved nucleotide composition (e.g. a certain GC or the particular abundance distribution of k-mers) and this will be also reflected in sequence fragments of the genomes (Metagenomics - a guide from sampling to data analysis, 2012).

K-mer

GC content GC content expressed as a percentage value:

Example of compositional binning algorithms TACOA: A multi-class taxonomic classifier was developed for environmental genomic fragments. It applies the intuitive idea of the k-nearest neighbor (k-NN) approach a genomic fragment is defined as a DNA sequence of a given length. The total number of oligonucleotides of length l, from the alphabet ∑ = {a, t, c, g} is given by 4l. Each genomic fragment is represented as a vector (i.e. GFV). For each of the possible four oligonucleotides in a sequence, the vector stores the ratio between the observed frequency of that oligonucleotide to the expected frequency given the GC-content of that genomic fragment.

Cont. Example of compositional binning algorithms It use a smoother kernel function with Gaussian density to profit from its implicit weighting scheme, thus allowing more flexibility on setting the neighborhood width and in handling high-dimensional input data. The weights given by a smoother kernel function decrease as the Euclidean distance between the classified GFV and the reference vector increases. The rate at which the weights decreases is controlled by the neighborhood width λ

Genome is selected from the data set comprising 373 genomes and fragmented subsequently. The collection of genomic fragments is regarded as the test set from which each fragment is drawn and subsequently classified. Classification of each test fragment is carried out using the remaining 372 organisms as a reference.

Accuracy

Hybrid, a combination of both methods PhymmBL : Phymm, a classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL (rhymes with "thimble"), the hybrid classifier included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.

Metagenomics Binning Technique Via Using Random Forest It use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between the sequences. By using forward sequential feature selection. Then, based on the reduced feature set compared the classification accuracy between Random Forest classifier and Naïve Bayes classifier. Also, analyze and compare the forward sequential feature selection to the popular t test feature selection method, and analyze the effect of these two feature reduction and selection methods on the classification accuracy of the DNA sequences. It found, as sequence feature representation, that using the existence or non-existence of K-mers rather than using the actual frequency of K-mers in a particular reads, result in better and more accurate classification (Helal S. & Dalila M., 2013)

Pipeline for Binning Technique Via Using Random Forest

Results Using Forward Sequential Feature Seletion using Naïve Bayes And Random Forest