Download presentation
Presentation is loading. Please wait.
Published byDustin Green Modified over 6 years ago
1
Comparative metagenomics quantifying similarities between environments
Bas E. Dutilh
2
Taxonomic or functional profiles
Kip et al. Env. Microbiol. Rep. 2011 Boleij et al. Mol. Cell. Proteomics 2012 Trindade-Silva et al. PLoS ONE 2012
3
Clustering profiles Calculate pairwise distances Create cladogram
4
Clustering profiles Calculate pairwise distances Create cladogram
(BioNJ) Gascuel Mol. Biol. Evol. 1997
5
Clustering profiles Calculate pairwise distances
Manhattan distance Correlation between profiles High correlation ↔ similar environment Low correlation ↔ dissimilar environment Angle between vectors in n-dimensional space Small angle ↔ similar environment Large angle ↔ dissimilar environment Wootters distance between profiles Create cladogram (BioNJ) taxa / functions → frequency → freq taxon 2 → freq taxon 1 → ... ← freq taxon 3 Wootters Phys. Rev. D 1981
6
Metagenomes of water and water animals
BlastN reads against Genbank, E-value ≤10−5 Taxonomic profiles including parent clades Wootters distance formula BioNJ cladogram Trindade-Silva et al. PLoS ONE 2012
7
Viral metagenomic samples
% reads used (BlastN mapping to Genbank) human water BlastN reads against Genbank, E-value ≤10−3 Taxonomic profiles including parent clades Distance = 1 minus correlation BioNJ cladogram Dutilh et al. Bioinformatics 2012
8
Many unknowns in viral metagenomes
Mokili et al. Curr. Opin. Virology 2012
9
K-mer (k=2) clustering Willner et al. Env. Microbiol. 2009
10
A sequence of 3 million nucleotides
2-mer profiles 4 * 4 = 16 dimensions (2-mers) ~3 million steps in each dimension 16 * = possibilities 4-mer profiles 4 * 4 * 4 * 4 = 256 dimensions (4-mers) 256 * = possibilities Sequencing reads or contigs (~200 nt) 4200 = 2.5 * possibilities
11
Cross-assembly Interpret viromes in terms of one another
Combine sequencing reads from different metagenomes in a single assembly Use your favorite assembly tool Cross-contigs contain reads from more than 1 sample Cross-contigs directly represent the overlap between samples We interpret contigs as “metagenomic entities” Ready for e.g. BLAST searches Dutilh et al. Bioinformatics 2012
12
Sample size (# contigs) →
Distance formulas Similarity based on number of cross-contigs (# cross-contigs) → Similarity Sample size (# contigs) →
13
Distance formulas Large metagenomes may share more cross-contigs with unrelated large samples than with closely related small samples Size correction necessary ← Minimum of the two sample sizes (# contigs) (# cross-contigs) → Similarity Sample size (# contigs) →
14
Sample size (# contigs) →
Distance formulas Minimum metagenome size Weighted average metagenome size (SHOT) (# cross-contigs) → Similarity Sample size (# contigs) → Korbel et al. Trends Genet. 2002
15
Distance formulas Contig content gives qualitative distances: what metagenomic entities are there? Quantitative distances can be calculated by taking the number of incorporated reads into account reads in ctg2 → reads in ctg1 → ← reads in ctg3 Dutilh et al. Bioinformatics 2012
16
Simulated metagenomes 30 species each with decreasing overlap
Firmicutes Proteobacteria Dutilh et al. Bioinformatics 2012
17
Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) 0% 30% 60% 90% 10% 40% 70% 100% 20% 50% 80% Distance → Dutilh et al. Bioinformatics 2012
18
Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) Dutilh et al. Bioinformatics 2012
19
Cladogram water human BlastN crAss Dutilh et al. Bioinformatics 2012
20
Similar numbers of utilized reads
human water Percentage of reads used → Metagenomes → Dutilh et al. Bioinformatics 2012
21
http://edwards.sdsu.edu/crass/ Stand-alone on SourceForge
Dutilh et al. Bioinformatics 2012
22
2 or 3 samples: no cladogram
Dutilh et al. Bioinformatics 2012
23
Experiment
24
Cross-assembly Advantages: Fast programs available (Newbler)
No full-length homology necessary Independent of reference database Sequence of shared entities for further analysis
25
Cross-assembly Dutilh et al. In preparation
26
Acknowledgements Robert Schmieder Jim Nulton Ben Felts Peter Salamon
Robert A. Edwards John L. Mokili
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.