Strain profiling with StrainPhlAn and PanPhlAn Nicola Segata Strain profiling with StrainPhlAn and PanPhlAn Curtis Huttenhower (chuttenh@hsph.harvard.edu) Galeb Abu-Ali (gabuali@hsph.harvard.edu) Ali Rahnavard (rah@broadinstitute.org) STAMPS 2017 08-08-17 Harvard T.H. Chan School of Public Health Department of Biostatistics
Efficient assembly-free meta’omics by leveraging isolates II III IV V I II III IV V II III II I IV I I II III II V IV V V Species pan-genomes 7,677 containing 18.6 million gene clusters Core genes Marker genes NCBI isolate genomes Archaea 300 Bacteria 12,926 Viruses 3,565 Eukaryota 112 Open reading frames 49.0 million total genes RepoPhlAn ChocoPhlAn http://www.metaref.org
StrainPhlAn: metagenomic strain identification and tracking http://segatalab.cibio.unitn.it/tools/strainphlan
A tool for strain level population genomics China Denmark Estonia Finland Peru’ Hungary Italy Norway France Spain Sweden USA Germany P. copri as an example species Alignment length: 66k nt Median SNPs: 830 [3.6%] # pos. samples: 123
A tool for strain level population genomics Alignment length: 62k nt Median SNPs: 830 [1.3%] # pos. samples: 123
Most bugs (in the gut) are dominated by one stable strain
Most bugs (in the gut) are dominated by one stable strain
There’s a lot of strain-level variation left to discover Median divergence from reference markers
PanPhlAn: the approach http://bitbucket.org/CibioCM/panphlan mapping Read Metagenomic sample Gene coverage Microbial pangenomes Cluster to Gene families Pan-gene family coverage Abundance-sorted pan-gene families Coverage Multi-copy genes Plateau of genes from one metagenome’s strain Absent genes
PanPhlAn for “meta-epidemiology” http://bitbucket.org/CibioCM/panphlan Metagenomes from [Loman et al., 2013]
Strain-level epidemiology of human-associated E. coli with PanPhlAn STEC Scholz et al., Nature Methods, 2016 T2D (China) German outbreak Reference genomes Liver Cirr. (China) Infants (Italy) CRC (Europe) HMP (USA) Obesity (Europe) Neilsen (Europe) T2D (Finland) Rampelli (Africa) Liu (Mongolia) Tito (Peru) Segre (Skin) B1 B2 ~5,000 metagenomes (and counting) All continents Many EU countries A D
Multiple options for strain tracking in metagenomes StrainPhlAn: Map reads to core markers and call SNPs. Requires ~10x coverage, ~0.1% error rate. PanPhlAn: Map reads to pan-genomes and identify absent genes. Requires ~1x coverage, ~1% error rate. Both work uniquely well for meta-analysis. Not sensitive to typical batch effects. http://segatalab.cibio.unitn.it/tools/strainphlan http://segatalab.cibio.unitn.it/tools/panphlan
https://bitbucket.org/biobakery/biobakery/wiki/strainphlan StrainPhlAn tutorial https://bitbucket.org/biobakery/biobakery/wiki/strainphlan
There’s a lot of strain-level variation left to discover Phylogenetic branch % spanned by reference vs. “wild” bugs
Gene-family distribution curves Select samples with “step” distribution (colored curves) strain of species present Base coverage Reject non-step (gray) curves E. coli gene-families
Synthetic and semi-synthetic validation Coverage Coverage Coverage Coverage Coverage
PanPhlAn on Eubacterium rectale Only one Eubacterium rectale genome used here
PanPhlAn on Eubacterium rectale