Evidence of Selection on Genomic GC Content in Bacteria Falk Hildebrand Adam Eyre-Walker
Genomic G+C content
Genomic GC content
Codons ATA CCC CTA CCT Non-synonymous Synonymous 2-fold : TTT TTC 4-fold : CCT CCC CCA CCG
Genomic GC content
Variation
Correlations
Explanations Mutation bias Suoeka (1961) & Freese (1962) Intrinsic and/or extrinsic Selection Many authors Biased gene conversion Anonymous referees
Correlates Genome size positive correlation Lifestyle higher GC in free living Aerobiosis higher GC in aerobic Nitrogen utilization higher amongst N fixers Temperature higher amongst thermophiles?
Evidence of selection I Escherichia coli Mutation pattern 273 GC AT versus 131 AT GC Predicted GC content = 0.32 Observed GC content = 0.50 Observed GC at neutral sites = 0.58 Lynch (2007) Origins of genome architecture
Evidence of selection II Phylogenetic analyses Mycobacterium leprae (Lynch 2007) Escherichia coli (Balbi et al. 2009) 5 pathogenic bacteria (Hershberg and Petrov 2010)
Phylogenetic analysis GAAGGG
Evidence of selection II Phylogenetic analyses Mycobacterium leprae (Lynch 2007) Escherichia coli (Balbi et al. 2009) 5 pathogenic bacteria (Hershberg and Petrov 2010) Excess of GC AT
Test of mutation bias If GC content is Due to mutation bias alone Stationary And the infinite sites assumption holds Then # GC AT mutations = # AT GC mutations
Why? If GC stationary #GC AT subs = #AT GC subs All neutral mutations have same chance of fixation #GC AT muts = #AT GC muts
Identifying mutations Strain 1 ACT GCT TTG GCT TTA TGG Strain 2 ACT GCT TTG GCT TTA TGA Strain 3 ACT GCT TTG GCT TTA TGG Strain 4 ACT GCT TTC GCT TTA TGA Strain 5 ACC GCT TTC GCT TTA TGG Strain 6 ACT GCT TTG GCT TTA TGG TCTCCGCGGAGA
Orienting mutations Outgroup ACT GCT TTC GCT TTA TGG Strain 1 ACT GCT TTG GCT TTA TGG Strain 2 ACT GCT TTG GCT TTA TGA Strain 3 ACT GCT TTG GCT TTA TGG Strain 4 ACT GCT TTC GCT TTA TGA Strain 5 ACC GCT TTC GCT TTA TGG Strain 6 ACT GCT TTG GCT TTA TGG TCTCCGCGGAGA GC AT = 1 AT GC = 1
Orienting mutations Strain 1 ACT GCT TTG GCT TTA TGG Strain 2 ACT GCT TTG GCT TTA TGA Strain 3 ACT GCT TTG GCT TTA TGG Strain 4 ACT GCT TTC GCT TTA TGA Strain 5 ACC GCT TTC GCT TTA TGG Strain 6 ACT GCT TTG GCT TTA TGG TCTCGCGCGAGA GC AT = 1 AT GC = 1
Test of mutation bias If GC content is Due to mutation bias alone Stationary And the infinite sites assumption holds Then # GC AT = # AT GC
Four-fold synonymous sites
Codons ATA CCC CTA CCT Non-synonymous Synonymous 2-fold : TTT TTC 4-fold : CCT CCC CCA CCG
Data Popset Keyword “bacteria” 8 or more sequences from same species 149 bacterial species 8 phyla, 15 classes and 77 genera 1 or more genes 10 or more synonymous polymorphisms 4-fold diversity < 0.1
Overall result No. of SNPs GC AT11045 AT GC8309 P<
Bias versus GC4 Z = GC AT GC AT No. speciesZ > 0.5P-value GC-rich8269< GC-poor
Phylogenetic distribution PhylumClassNo. of speciesGC4 range Mean Z (GC4<0.34) Mean Z (GC4>0.34) Actinobacteria no species0.64 Bacteroidetes Chlamydiae+Chlamydiae no species CyanobacteriaChroococcales no species0.53 CyanobacteriaNostocales no species CyanobacteriaOscillatoriales 20.41no species0.38 CyanobacteriaStigonemales 10.40no species0.59 FirmicutesBacilli FirmicutesClostridia no species ProteobacteriaAlphaproteobacteria ProteobacteriaBetaproteobacteria no species0.67 Proteobacteriadelta/epsilon Proteobacteria Gammaproteobacte ria Spirochaetes TenericutesMollicutes no species
Potential problems Infinite sites assumption Sequencing error
Infinite sites assumption Each mutation occurs at a site which is not polymorphic
Infinite sites assumption If GC content stationary #GC AT subs = #AT GC subs All neutral mutations have same chance of fixation #GC AT muts = #AT GC muts
Finite sites assumption If GC content stationary #GC AT subs = #AT GC subs All neutral mutations have same chance of fixation #GC AT muts = #AT GC muts But some mutations not evident as poly
Finite sites GC rich sequence Implies rate of AT GC > rate of GC AT Mutation rate low #AT GC poly = # GC AT poly Mutation rate high #AT GC poly < # GC AT poly
Finite sites theory GCAT uμ vμ Assume : stationary popn stationary GC
Finite sites theory
Predicting Z Assume finite sites neutrality Use GC4 to get f Use observed diversity to estimate μ Predict Z
Z pred
Z-Z pred No. of speciesZ-Z pred > 0P-value GC-rich8261< GC-poor
Mutation rate variation
Z-Z pred (exponential rates) No. of speciesZ-Z pred > 0P-value GC-rich GC-poor
Sequencing error No. of speciesZ > 0.5P-value GC-rich8260<0.0001
Explanations Non-stationary base composition Selection for translational efficiency Biased gene conversion Selection upon base composition
Explanations Non-stationary base composition Selection for translational efficiency Biased gene conversion Selection upon base composition
Non-stationary GC content
Non-stationary base composition
Explanations Non-stationary base composition Selection for translational efficiency Biased gene conversion Selection upon base composition
Selection on codon usage Amino AcidCodonHigh usageLow usage PhenylalanineUUU UUC ValineGUU GUC GUA GUG
Translational efficiency No. of speciesZ > 0.5P-value GC-rich3129<0.0001
Explanations Non-stationary base composition Selection for translational efficiency Biased gene conversion Selection upon base composition
Biased gene conversion ATAT CGCG AGAG CTCT CGCG CGCG
Four gamete test G A G T C A G A G T C A C T No recombination Recombination
Biased gene conversion No. speciesZ > 0.5P-value GC-rich GC ATAT GCP-value No. of SNPs <0.0001
Biased gene conversion GCAT -ww if N e w >> 1 BGC effective if N e w << 1 BGC ineffective
Biased gene conversion r / mp-value GC Z Z-Z pred GC4 pred species with estimate of r / m Vos & Didelot (2009) ISME J.
Biased gene conversion θ r / mp-value GC Z Z-Z pred GC4 pred
Explanations Non-stationary base composition Selection for translational efficiency Biased gene conversion Selection upon base composition
Selection on GC content GCAT uμ vμ +s-s
Selection on GC content
Selection on GC4
f = α + β GC4 f = GC4
Selection on GC4 f = α + β GC4 f = GC4
Summary Large excess of GC AT mutations at 4-fold sites Particularly in GC-rich species Not due to Infinite sites Sequencing error Translational selection Biased gene conversion Therefore Selection on GC4
Selection on genomic GC Genomic GC GC4
Environmental meta-genomics Foerstner et al. (2005) EMBO Reports
Environmental meta-genomics
Correlates Genome size positive correlation Lifestyle higher GC in free living Aerobiosis higher GC in aerobic Nitrogen utilization higher GC amongst N fixers Temperature higher amongst thermophiles?
Thanks Falk Hildebrand Axel Meyer
Further reading Hildebrand et al. (2010) PLoS Genetics Hershberg and Petrov (2010) PLoS Genetics Rocha and Feil (2010) PLoS Genetics
Protein coding sites