Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative metagenomics quantifying similarities between environments

Similar presentations


Presentation on theme: "Comparative metagenomics quantifying similarities between environments"— Presentation transcript:

1 Comparative metagenomics quantifying similarities between environments
Bas E. Dutilh

2 Taxonomic or functional profiles
Kip et al. Env. Microbiol. Rep. 2011 Boleij et al. Mol. Cell. Proteomics 2012 Trindade-Silva et al. PLoS ONE 2012

3 Clustering profiles Calculate pairwise distances Create cladogram

4 Clustering profiles Calculate pairwise distances Create cladogram
(BioNJ) Gascuel Mol. Biol. Evol. 1997

5 Clustering profiles Calculate pairwise distances
Manhattan distance Correlation between profiles High correlation ↔ similar environment Low correlation ↔ dissimilar environment Angle between vectors in n-dimensional space Small angle ↔ similar environment Large angle ↔ dissimilar environment Wootters distance between profiles Create cladogram (BioNJ) taxa / functions → frequency → freq taxon 2 → freq taxon 1 → ... ← freq taxon 3 Wootters Phys. Rev. D 1981

6 Metagenomes of water and water animals
BlastN reads against Genbank, E-value ≤10−5 Taxonomic profiles including parent clades Wootters distance formula BioNJ cladogram Trindade-Silva et al. PLoS ONE 2012

7 Viral metagenomic samples
% reads used (BlastN mapping to Genbank) human water BlastN reads against Genbank, E-value ≤10−3 Taxonomic profiles including parent clades Distance = 1 minus correlation BioNJ cladogram Dutilh et al. Bioinformatics 2012

8 Many unknowns in viral metagenomes
Mokili et al. Curr. Opin. Virology 2012

9 K-mer (k=2) clustering Willner et al. Env. Microbiol. 2009

10 A sequence of 3 million nucleotides
2-mer profiles 4 * 4 = 16 dimensions (2-mers) ~3 million steps in each dimension 16 * = possibilities 4-mer profiles 4 * 4 * 4 * 4 = 256 dimensions (4-mers) 256 * = possibilities Sequencing reads or contigs (~200 nt) 4200 = 2.5 * possibilities

11 Cross-assembly Interpret viromes in terms of one another
Combine sequencing reads from different metagenomes in a single assembly Use your favorite assembly tool Cross-contigs contain reads from more than 1 sample Cross-contigs directly represent the overlap between samples We interpret contigs as “metagenomic entities” Ready for e.g. BLAST searches Dutilh et al. Bioinformatics 2012

12 Sample size (# contigs) →
Distance formulas Similarity based on number of cross-contigs (# cross-contigs) → Similarity Sample size (# contigs) →

13 Distance formulas Large metagenomes may share more cross-contigs with unrelated large samples than with closely related small samples Size correction necessary ← Minimum of the two sample sizes (# contigs) (# cross-contigs) → Similarity Sample size (# contigs) →

14 Sample size (# contigs) →
Distance formulas Minimum metagenome size Weighted average metagenome size (SHOT) (# cross-contigs) → Similarity Sample size (# contigs) → Korbel et al. Trends Genet. 2002

15 Distance formulas Contig content gives qualitative distances: what metagenomic entities are there? Quantitative distances can be calculated by taking the number of incorporated reads into account reads in ctg2 → reads in ctg1 → ← reads in ctg3 Dutilh et al. Bioinformatics 2012

16 Simulated metagenomes 30 species each with decreasing overlap
Firmicutes Proteobacteria Dutilh et al. Bioinformatics 2012

17 Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) 0% 30% 60% 90% 10% 40% 70% 100% 20% 50% 80% Distance → Dutilh et al. Bioinformatics 2012

18 Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) Dutilh et al. Bioinformatics 2012

19 Cladogram water human BlastN crAss Dutilh et al. Bioinformatics 2012

20 Similar numbers of utilized reads
human water Percentage of reads used → Metagenomes → Dutilh et al. Bioinformatics 2012

21 http://edwards.sdsu.edu/crass/ Stand-alone on SourceForge
Dutilh et al. Bioinformatics 2012

22 2 or 3 samples: no cladogram
Dutilh et al. Bioinformatics 2012

23 Experiment

24 Cross-assembly Advantages: Fast programs available (Newbler)
No full-length homology necessary Independent of reference database Sequence of shared entities for further analysis

25 Cross-assembly Dutilh et al. In preparation

26 Acknowledgements Robert Schmieder Jim Nulton Ben Felts Peter Salamon
Robert A. Edwards John L. Mokili


Download ppt "Comparative metagenomics quantifying similarities between environments"

Similar presentations


Ads by Google