Download presentation
Presentation is loading. Please wait.
Published byHarvey Milo Shepherd Modified over 9 years ago
1
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu
2
Overview of Talk Metagenomics and the binning problem. CompostBin Metagenomics and the binning problem. CompostBin
3
The Microbial World
4
Exploring the Microbial World Culturing Majority of microbes currently unculturable. No ecological context. Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?” Culturing Majority of microbes currently unculturable. No ecological context. Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”
5
Metagenomics
6
Interpreting Metagenomic Data Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary New Sequencing Technologies Enormous amount of data Short Reads Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary New Sequencing Technologies Enormous amount of data Short Reads
7
Metagenomic Binning Classification of sequences by taxa
8
Why Bin at all?
9
Binning in Action Glassy Winged Sharpshooter (Homalodisca coagulata). Feeds on plant xylem (poor in organic nutrients). Microbial Endosymbionts
11
Current Binning Methods Assembly Align with Reference Genome Database Search [ MEGAN, BLAST ] Phylogenetic Analysis DNA Composition [ TETRA,Phylopythia ] Assembly Align with Reference Genome Database Search [ MEGAN, BLAST ] Phylogenetic Analysis DNA Composition [ TETRA,Phylopythia ]
12
Current Binning Methods Need closely related reference genomes. Poor performance on short fragments. Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable Complex Communities Hard to Bin. Need closely related reference genomes. Poor performance on short fragments. Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable Complex Communities Hard to Bin.
13
Overview of Talk Metagenomics and the binning problem. CompostBin Metagenomics and the binning problem. CompostBin
14
Genome Signatures Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s] What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism? Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s] What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
15
Imperfect World Horizontal Gene Transfer Recent Estimates [Ge et al. 2005] Varies between 0-6% of genes. Typically ~2%. But… Amelioration Horizontal Gene Transfer Recent Estimates [Ge et al. 2005] Varies between 0-6% of genes. Typically ~2%. But… Amelioration
16
DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers
17
Working with K-mers for Binning. Curse of Dimensionality : O(4 K ) independent dimensions. Statistical noise increases with decreasing fragment lengths. Project data into a lower dimensional space to decrease noise. Principal Component Analysis. Working with K-mers for Binning. Curse of Dimensionality : O(4 K ) independent dimensions. Statistical noise increases with decreasing fragment lengths. Project data into a lower dimensional space to decrease noise. Principal Component Analysis. DNA-composition metrics
18
PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
19
Effect of Skewed Relative Abundance B. anthracis and L. monogocytes Abundance 1:1 Abundance 20:1
20
A Weighting Scheme For each read, find overlap with other sequences
21
A Weighting Scheme Calculate the redundancy of each position. 4553 Weight is inverse of average redundancy.
22
Weighted PCA Calculate weighted mean µ w : Calculates weighted co-variance matrix M w PCs are eigenvectors of M w. Use first three PCs for further analysis. Calculate weighted mean µ w : Calculates weighted co-variance matrix M w PCs are eigenvectors of M w. Use first three PCs for further analysis. T wi N 1i wiiw )μ(X)μ(XwM N Xw μ N 1i ii w
23
Weighted PCA Calculate weighted mean µ w : Calculates weighted co-variance matrix M w PCs are eigenvectors of M w. Use first three PCs for further analysis.
24
Weighted PCA Calculate weighted mean µ w : Calculates weighted co-variance matrix M w Principal Components are eigenvectors of M w. Use first three PCs for further analysis. w w i X i i 1 N N M w w i (X i w )(X i w ) T i
25
Weighted PCA separates species B. anthracis and L. monogocytes : 20:1 PCAWeighted PCA
26
Un-supervised Classification ?
27
Semi-Supervised Classification 31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer Reads containing these marker genes can be classified with high reliability. 31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer Reads containing these marker genes can be classified with high reliability.
28
Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm
29
The Semi-supervised Normalized Cut Algorithm 1.Calculate the K-nearest neighbor graph from the point set. 2.Update graph with marker information. oIf two nodes are from the same species, add an edge between them. oIf two nodes are from different species, remove any edge between them. 3.Bisect the graph using the normalized-cut algorithm. 1.Calculate the K-nearest neighbor graph from the point set. 2.Update graph with marker information. oIf two nodes are from the same species, add an edge between them. oIf two nodes are from different species, remove any edge between them. 3.Bisect the graph using the normalized-cut algorithm.
30
Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62] Apply algorithm recursively
31
Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
32
Testing Simulate Metagenomic Sequencing Sanger Reads Variables Number of species Relative abundance GC content Phylogenetic Diversity Test on a “real” dataset where answer is well-established. Simulate Metagenomic Sequencing Sanger Reads Variables Number of species Relative abundance GC content Phylogenetic Diversity Test on a “real” dataset where answer is well-established.
33
Results
35
Conclusions/Future Directions Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species Future Work Holy Grail : Complex Communities Semi-supervised projection? Hybrid Assembly/Binning Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species Future Work Holy Grail : Complex Communities Semi-supervised projection? Hybrid Assembly/Binning
36
Acknowledgements UC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann UC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan Princeton University Simon Levin Josh Weitz Jonathan Dushoff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.