Comparative metagenomics quantifying similarities between environments Bas E. Dutilh
Taxonomic or functional profiles Kip et al. Env. Microbiol. Rep. 2011 Boleij et al. Mol. Cell. Proteomics 2012 Trindade-Silva et al. PLoS ONE 2012
Clustering profiles Calculate pairwise distances Create cladogram
Clustering profiles Calculate pairwise distances Create cladogram (BioNJ) Gascuel Mol. Biol. Evol. 1997
Clustering profiles Calculate pairwise distances Manhattan distance Correlation between profiles High correlation ↔ similar environment Low correlation ↔ dissimilar environment Angle between vectors in n-dimensional space Small angle ↔ similar environment Large angle ↔ dissimilar environment Wootters distance between profiles Create cladogram (BioNJ) taxa / functions → frequency → freq taxon 2 → freq taxon 1 → ... ← freq taxon 3 Wootters Phys. Rev. D 1981
Metagenomes of water and water animals BlastN reads against Genbank, E-value ≤10−5 Taxonomic profiles including parent clades Wootters distance formula BioNJ cladogram Trindade-Silva et al. PLoS ONE 2012
Viral metagenomic samples % reads used (BlastN mapping to Genbank) human water BlastN reads against Genbank, E-value ≤10−3 Taxonomic profiles including parent clades Distance = 1 minus correlation BioNJ cladogram Dutilh et al. Bioinformatics 2012
Many unknowns in viral metagenomes Mokili et al. Curr. Opin. Virology 2012
K-mer (k=2) clustering Willner et al. Env. Microbiol. 2009
A sequence of 3 million nucleotides 2-mer profiles 4 * 4 = 16 dimensions (2-mers) ~3 million steps in each dimension 16 * 3.000.000 = 48.000.000 possibilities 4-mer profiles 4 * 4 * 4 * 4 = 256 dimensions (4-mers) 256 * 3.000.000 = 768.000.000 possibilities Sequencing reads or contigs (~200 nt) 4200 = 2.5 * 10120 possibilities
Cross-assembly Interpret viromes in terms of one another Combine sequencing reads from different metagenomes in a single assembly Use your favorite assembly tool Cross-contigs contain reads from more than 1 sample Cross-contigs directly represent the overlap between samples We interpret contigs as “metagenomic entities” Ready for e.g. BLAST searches Dutilh et al. Bioinformatics 2012
Sample size (# contigs) → Distance formulas Similarity based on number of cross-contigs (# cross-contigs) → Similarity Sample size (# contigs) →
Distance formulas Large metagenomes may share more cross-contigs with unrelated large samples than with closely related small samples Size correction necessary ← Minimum of the two sample sizes (# contigs) (# cross-contigs) → Similarity Sample size (# contigs) →
Sample size (# contigs) → Distance formulas Minimum metagenome size Weighted average metagenome size (SHOT) (# cross-contigs) → Similarity Sample size (# contigs) → Korbel et al. Trends Genet. 2002
Distance formulas Contig content gives qualitative distances: what metagenomic entities are there? Quantitative distances can be calculated by taking the number of incorporated reads into account reads in ctg2 → reads in ctg1 → ← reads in ctg3 Dutilh et al. Bioinformatics 2012
Simulated metagenomes 30 species each with decreasing overlap Firmicutes Proteobacteria Dutilh et al. Bioinformatics 2012
Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) 0% 30% 60% 90% 10% 40% 70% 100% 20% 50% 80% Distance → Dutilh et al. Bioinformatics 2012
Six simulated metagenomes of ~25 Firmicutes each Three simulated metagenomes of ~25 Actinobacteria each Increasing noise (0-100% Proteobacteria) Dutilh et al. Bioinformatics 2012
Cladogram water human BlastN crAss Dutilh et al. Bioinformatics 2012
Similar numbers of utilized reads human water Percentage of reads used → Metagenomes → Dutilh et al. Bioinformatics 2012
http://edwards.sdsu.edu/crass/ Stand-alone on SourceForge Dutilh et al. Bioinformatics 2012
2 or 3 samples: no cladogram Dutilh et al. Bioinformatics 2012
Experiment
Cross-assembly Advantages: Fast programs available (Newbler) No full-length homology necessary Independent of reference database Sequence of shared entities for further analysis
Cross-assembly Dutilh et al. In preparation
Acknowledgements Robert Schmieder Jim Nulton Ben Felts Peter Salamon Robert A. Edwards John L. Mokili http://edwards.sdsu.edu/crass/