CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Dimension reduction (1)
Metabarcoding 16S RNA targeted sequencing
Computational Analysis of the Taxanomical Classification of Short 16S rRNA Sequences Christel Chehoud Mentor: Brian Haas.
Phylogenetic reconstruction
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Dimensional reduction, PCA
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Metagenomics Binning and Machine Learning
Metagenomic Analysis Using MEGAN4
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
Molecular Microbial Ecology
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding.
Probes can be designed in an evolutionary hierarchy.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Prokaryote Taxonomy & Diversity
 16S rRNA gene marker  intra-gene variability  primer selection  size & information content Primer selection, information content, alignment and length.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
PHYLOGENETIC DIVERSITY Methods and applications Divya B. PK lab, CES, IISc.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Supervisor: Nakhmani Arie Semester: Winter 2007 Target Recognition Harmatz Isca.
Accurate estimation of microbial communities using 16S tags
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Canadian Bioinformatics Workshops
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9):
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
16S rRNA Experimental Design
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Comparative metagenomics quantifying similarities between environments
Metagenomic assembly Cedric Notredame
Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.
Research in Computational Molecular Biology , Vol (2008)
Workshop on the analysis of microbial sequence data using ARB
Machine Learning Dimensionality Reduction
Blind Signal Separation using Principal Components Analysis
Rosie Coates-Brown Final year Bioinformatics trainee
Chapter 19 Molecular Phylogenetics
Taxonomic identification and phylogenetic profiling
Genome resolved metagenomics
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis

Overview of Talk  Metagenomics and the binning problem.  CompostBin  Metagenomics and the binning problem.  CompostBin

The Microbial World

Exploring the Microbial World  Culturing  Majority of microbes currently unculturable.  No ecological context.  Molecular Surveys (e.g. 16S rRNA)  “who is out there?”  “what are they doing?”  Culturing  Majority of microbes currently unculturable.  No ecological context.  Molecular Surveys (e.g. 16S rRNA)  “who is out there?”  “what are they doing?”

Metagenomics

Interpreting Metagenomic Data  Nature of Metagenomic Data  Mosaic  Intraspecies polymorphism  Fragmentary  New Sequencing Technologies  Enormous amount of data  Short Reads  Nature of Metagenomic Data  Mosaic  Intraspecies polymorphism  Fragmentary  New Sequencing Technologies  Enormous amount of data  Short Reads

Metagenomic Binning Classification of sequences by taxa

Why Bin at all?

Binning in Action  Glassy Winged Sharpshooter (Homalodisca coagulata).  Feeds on plant xylem (poor in organic nutrients).  Microbial Endosymbionts

Current Binning Methods  Assembly  Align with Reference Genome  Database Search [ MEGAN, BLAST ]  Phylogenetic Analysis  DNA Composition [ TETRA,Phylopythia ]  Assembly  Align with Reference Genome  Database Search [ MEGAN, BLAST ]  Phylogenetic Analysis  DNA Composition [ TETRA,Phylopythia ]

Current Binning Methods  Need closely related reference genomes.  Poor performance on short fragments.  Sanger sequence reads bp long.  Current assembly methods unreliable  Complex Communities Hard to Bin.  Need closely related reference genomes.  Poor performance on short fragments.  Sanger sequence reads bp long.  Current assembly methods unreliable  Complex Communities Hard to Bin.

Overview of Talk  Metagenomics and the binning problem.  CompostBin  Metagenomics and the binning problem.  CompostBin

Genome Signatures  Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?  Yes [Karlin et al. 1990s]  What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?  Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?  Yes [Karlin et al. 1990s]  What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Imperfect World  Horizontal Gene Transfer  Recent Estimates [Ge et al. 2005]  Varies between 0-6% of genes.  Typically ~2%.  But…  Amelioration  Horizontal Gene Transfer  Recent Estimates [Ge et al. 2005]  Varies between 0-6% of genes.  Typically ~2%.  But…  Amelioration

DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

 Working with K-mers for Binning.  Curse of Dimensionality : O(4 K ) independent dimensions.  Statistical noise increases with decreasing fragment lengths.  Project data into a lower dimensional space to decrease noise.  Principal Component Analysis.  Working with K-mers for Binning.  Curse of Dimensionality : O(4 K ) independent dimensions.  Statistical noise increases with decreasing fragment lengths.  Project data into a lower dimensional space to decrease noise.  Principal Component Analysis. DNA-composition metrics

PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative Abundance B. anthracis and L. monogocytes Abundance 1:1 Abundance 20:1

A Weighting Scheme For each read, find overlap with other sequences

A Weighting Scheme Calculate the redundancy of each position Weight is inverse of average redundancy.

Weighted PCA  Calculate weighted mean µ w :  Calculates weighted co-variance matrix M w  PCs are eigenvectors of M w.  Use first three PCs for further analysis.  Calculate weighted mean µ w :  Calculates weighted co-variance matrix M w  PCs are eigenvectors of M w.  Use first three PCs for further analysis. T wi N 1i wiiw )μ(X)μ(XwM    N Xw μ N 1i ii w   

Weighted PCA  Calculate weighted mean µ w :  Calculates weighted co-variance matrix M w  PCs are eigenvectors of M w.  Use first three PCs for further analysis.

Weighted PCA  Calculate weighted mean µ w :  Calculates weighted co-variance matrix M w  Principal Components are eigenvectors of M w.  Use first three PCs for further analysis.  w  w i X i i  1 N N M w  w i (X i  w )(X i  w ) T i  

Weighted PCA separates species B. anthracis and L. monogocytes : 20:1 PCAWeighted PCA

Un-supervised Classification ?

Semi-Supervised Classification  31 Marker Genes [courtesy Martin Wu]  Omni-present  Relatively Immune to Lateral Gene Transfer  Reads containing these marker genes can be classified with high reliability.  31 Marker Genes [courtesy Martin Wu]  Omni-present  Relatively Immune to Lateral Gene Transfer  Reads containing these marker genes can be classified with high reliability.

Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut Algorithm 1.Calculate the K-nearest neighbor graph from the point set. 2.Update graph with marker information. oIf two nodes are from the same species, add an edge between them. oIf two nodes are from different species, remove any edge between them. 3.Bisect the graph using the normalized-cut algorithm. 1.Calculate the K-nearest neighbor graph from the point set. 2.Update graph with marker information. oIf two nodes are from the same species, add an edge between them. oIf two nodes are from different species, remove any edge between them. 3.Bisect the graph using the normalized-cut algorithm.

Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62] Apply algorithm recursively

Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

Testing  Simulate Metagenomic Sequencing  Sanger Reads  Variables  Number of species  Relative abundance  GC content  Phylogenetic Diversity  Test on a “real” dataset where answer is well-established.  Simulate Metagenomic Sequencing  Sanger Reads  Variables  Number of species  Relative abundance  GC content  Phylogenetic Diversity  Test on a “real” dataset where answer is well-established.

Results

Conclusions/Future Directions  Satisfactory performance  No Training on Existing Genomes   Sanger Reads   Low number of Species   Future Work  Holy Grail : Complex Communities  Semi-supervised projection?  Hybrid Assembly/Binning  Satisfactory performance  No Training on Existing Genomes   Sanger Reads   Low number of Species   Future Work  Holy Grail : Complex Communities  Semi-supervised projection?  Hybrid Assembly/Binning

Acknowledgements UC Davis  Jonathan Eisen  Martin Wu  Dongying Wu  Ichitaro Yamazaki  Amber Hartman  Marcel Huntemann UC Berkeley  Lior Pachter  Richard Karp  Ambuj Tewari  Narayanan Manikandan Princeton University  Simon Levin  Josh Weitz  Jonathan Dushoff