Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis
Variation in the mutation rate: Between different chromosomes Between regions on chromosomes Neighbouring nucleotides
Simple context effects: Hwang and Green (2004) PNAS 101:
Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT
Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT Degenerate context: AGTCGGTTACCGTGYSRGYGAACGTGT
Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT Degenerate context: AGTCGGTTACCGTGYSRGYGAACGTGT No context / Complex context
Our approach to the problem Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp. Human Chimp
Our approach to the problem Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp. Human Chimp Do we see more coincident SNPs than expected by chance?
The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database.
The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database. Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.
The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database. Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position. Repeating both including and excluding CpG effects.
Results ~1.5 million chimp SNPs. ~310,000 81bp alignments containing a human and chimp SNP.
Results ~1.5 million chimp SNPs. ~310,000 81bp alignments containing a human and chimp SNP. Observe the number of coincident SNPs. Calculate the expected number, taking into account the effects of neighbouring nucleotides.
Results ObsExpRatio All (1.72,1.79) No-CpG (1.93,2.04)
Results C/TG/AC/AG/TC/GA/T C/T G/A C/A G/T C/G A/T
Alternative Explanations Bias in the Method Selection Ancestral Polymorphism Paralogous SNPs
Alternative Explanations Bias in the Method Selection Ancestral Polymorphism Paralogous SNPs
Methodological Bias Simulated data with same density of human and chimp SNPs as dbSNP under different divergence and mutation patterns. Method worked well under realistic conditions.
Methodological Bias DivObsExpRatio95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) DivObsExpRatio95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030) All sites (H&G): Non CpG sites (H&G):
Methodological Bias DivObsExpRatio95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) DivObsExpRatio95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030) All sites (H&G): Non CpG sites (H&G):
Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs
Selection Areas of low SNP density result in clustering: Human Chimp
Selection Areas of low SNP density result in clustering: Human Chimp Apparent excess of coincident SNPs
Selection No clustering:
Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs
Ancestral Polymorphism SNP inherited from common ancestor of chimp and human: T T T A T T T A T A T A Common Ancestor Human Chimp
Ancestral Polymorphism SNP inherited from common ancestor of chimp and human: T T T A T T T A T A T A Common Ancestor Human Chimp Increase in coincident SNPs
Ancestral Polymorphism Expect observed/expected ratio to be same for all transitions: C/TG/AC/AG/TC/GA/T C/T G/A C/A G/T C/G A/T
Ancestral Polymorphism Repeated initial analysis with macaque data. Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.
Ancestral Polymorphism Repeated initial analysis with macaque data. Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms. ObsExpRatio All (1.27,2.00) No-CpG (1.001,2.02)
Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs
Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
Paralogous SNPs Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions. Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy.
Paralogous SNPs Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions. Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy. AGCTGCACGT Y CGGCATCCAA SNP AGCTGCACGT T CGGCATCCAA Chromosome 1 AGCTGCACGT A CGGCATCCAA Chromosome 7 Artifactual SNP
Paralogous SNPs AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA
Paralogous SNPs AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA 3.6% of coincident SNPs are possibly a consequence of paralogous sequences
Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs Cryptic variation in the mutation rate
Context Analysis 4517 sequences containing non-CpG coincident SNPs flanked by 200bp. Tabulate triplet frequencies at each position in surrounding sequences. Test whether the proportions of triplets we observe at each position significantly different from the proportions in the sequences as a whole.
Context Analysis Coincident SNP in central position:
Context Analysis Coincident SNP in central position: No obvious context surrounding coincident SNPs
Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB.
Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB. If randomly distributed expect Poisson distribution and = 2 = 3.91
Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB. If randomly distributed expect Poisson distribution and = 2 = 3.91 2 = (p<0.001) and so sampling variance explains approximately 30% of total variance.
Genomic Distribution Featurerr2r2 p SNP density <0.001** Distance to Telomere Distance to Centromere Recombination Rate <0.001** Nucleosome Association Gene Density GC content
Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone. Recombination rate positively correlated with SNP density (r = 0.242, p<0.001). Partial correlation controlling for SNP density: r = 0.048, p=0.011**.
Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone. Recombination rate positively correlated with SNP density (r = 0.242, p<0.001). Partial correlation controlling for SNP density: r = 0.048, p=0.011**. SNP densities explain 6.5% of the variance, recombination rate explains 0.2% of the variance of coincident SNPs.
Genomic Distribution Featurerr2r2 p Coincident SNP Density <0.001** Distance to Telomere <0.001** Distance to Centromere ** Recombination Rate <0.001** Nucleosome Association <0.001** Gene Density ** GC content <0.001**
Quantification Use Log-normal distribution of relative mutation rates due to cryptic variation. Model the number of coincident SNPs under the effects of cryptic variation. Incorporate effects of divergence.
Quantification Use Log-normal distribution of relative mutation rates due to cryptic variation. Model the number of coincident SNPs under the effects of cryptic variation. Incorporate effects of divergence. What level of variation in the log-normal distribution explains our results?
Log-normal model Fastest 5% of sites mutate ~16.4 times faster than slowest 5% of sites.
Summary Cryptic variation in the mutation rate.
Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs.
Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic.
Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic. Genomic distribution of coincident SNPs is over-dispersed
Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic. Genomic distribution of coincident SNPs is over-dispersed Variation in mutation rate is substantial.
Acknowledgments Manolis Ladoukakis BBSRC People: Adam Eyre-Walker