Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Introduction to genomes & genome browsers
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Major insights from the HGP on Nature (2001) 15 th Feb Vol 409 special issue; pgs 814 & )Gene content 2)Proteome content 3)SNP identification.
Visualising and Exploring BS-Seq Data
Supplementary Figure S1 Distribution of observed (blue) and Poisson expected (red) standard deviation of human-chimpanzee divergence over different window.
Recombination and genetic variation – models and inference
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
Introduction to Computational Biology Topics. Molecular Data Definition of data  DNA/RNA  Protein  Expression Basics of programming in Matlab  Vectors.
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
DbSNP: the NCBI database of genetic variation S. T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski and K. Sirotkin, Nucleic Acids.
Reading the Blueprint of Life
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Analyse comparative des génomes de primates: mais où est donc passée la sélection naturelle ? Laurent Duret, Nicolas Galtier, Peter Arndt ACI-IMPBIO 4-5.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
1 Genetic Variability. 2 A population is monomorphic at a locus if there exists only one allele at the locus. A population is polymorphic at a locus if.
An Introduction to Bioinformatics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Genomics Analysis Chapter 20 Overview of topics to be discussed  The Human Genome Analysis  Variable Number Tandem Repeats  Short Tandem Repeats 
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
CS177 Lecture 10 SNPs and Human Genetic Variation
Rates and Fitness Effects of Mutations Adam Eyre-Walker (University of Sussex)
SNPs and the Human Genome Prof. Sorin Istrail. A SNP is a position in a genome at which two or more different bases occur in the population, each with.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Variations of neutral substitution patterns along mammalian genomes Julien Meunier, Laurent Duret Laboratoire de Biométrie et Biologie Evolutive CNRS -
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Neanderthals Noonan, et al. Sequencing and Analysis of Neanderthal Genomic DNA Green, et al. Analysis of one million base pairs of Neanderthal DNA Kristine.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Maria Warnefors, Vini Pereira, Adam Eyre-Walker
Construction of Substitution matrices
Genomics Chapter 18.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Can genes help explain our evolution? - What type of changes (regulatory or structural mutations?) - How many genes are involved?
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Single Nucleotide Polymorphisms (SNPs) By Amira Jhelum Rahul Shweta.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Modelling evolution Gil McVean Department of Statistics TC A G.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Lecture 6 Genetic drift & Mutation Sonja Kujala
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Common variation, GWAS & PLINK
CSE 182 Project.
Of Sea Urchins, Birds and Men
Visualising and Exploring BS-Seq Data
Volume 146, Issue 6, Pages (September 2011)
BLAT Blast Like Alignment Tool
E. Wang, Y. -C. Ding, P. Flodman, J. R. Kidd, K. K. Kidd, D. L
Jeffrey A. Fawcett, Hideki Innan  Trends in Genetics 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Changes in mutation rate or protein abundance are not observed in HATs when comparing rho+ to rho0 cells. Changes in mutation rate or protein abundance.
Figure Genetic characterization of the novel GYG1 gene mutation (A) GYG1_cDNA sequence and position of primers used. Genetic characterization of the novel.
Mutational Analysis of Ionizing Radiation Induced Neoplasms
Presentation transcript:

Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis

Variation in the mutation rate: Between different chromosomes Between regions on chromosomes Neighbouring nucleotides

Simple context effects: Hwang and Green (2004) PNAS 101:

Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT

Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT Degenerate context: AGTCGGTTACCGTGYSRGYGAACGTGT

Cryptic Variation: Remote context: AGTCGGTTACCGTGACGTTGAACGTGT Degenerate context: AGTCGGTTACCGTGYSRGYGAACGTGT No context / Complex context

Our approach to the problem Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp. Human Chimp

Our approach to the problem Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp. Human Chimp Do we see more coincident SNPs than expected by chance?

The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP.

The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database.

The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database. Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.

The method Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis. Extract all chimp SNPs from dbSNP with 50bp either side of SNP. BLAST chimp SNPs against human database. Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position. Repeating both including and excluding CpG effects.

Results ~1.5 million chimp SNPs. ~310,000 81bp alignments containing a human and chimp SNP.

Results ~1.5 million chimp SNPs. ~310,000 81bp alignments containing a human and chimp SNP. Observe the number of coincident SNPs. Calculate the expected number, taking into account the effects of neighbouring nucleotides.

Results ObsExpRatio All (1.72,1.79) No-CpG (1.93,2.04)

Results C/TG/AC/AG/TC/GA/T C/T G/A C/A G/T C/G A/T

Alternative Explanations Bias in the Method Selection Ancestral Polymorphism Paralogous SNPs

Alternative Explanations Bias in the Method Selection Ancestral Polymorphism Paralogous SNPs

Methodological Bias Simulated data with same density of human and chimp SNPs as dbSNP under different divergence and mutation patterns. Method worked well under realistic conditions.

Methodological Bias DivObsExpRatio95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) DivObsExpRatio95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030) All sites (H&G): Non CpG sites (H&G):

Methodological Bias DivObsExpRatio95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) DivObsExpRatio95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030) All sites (H&G): Non CpG sites (H&G):

Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs

Selection Areas of low SNP density result in clustering: Human Chimp

Selection Areas of low SNP density result in clustering: Human Chimp Apparent excess of coincident SNPs

Selection No clustering:

Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs

Ancestral Polymorphism SNP inherited from common ancestor of chimp and human: T T T A T T T A T A T A Common Ancestor Human Chimp

Ancestral Polymorphism SNP inherited from common ancestor of chimp and human: T T T A T T T A T A T A Common Ancestor Human Chimp Increase in coincident SNPs

Ancestral Polymorphism Expect observed/expected ratio to be same for all transitions: C/TG/AC/AG/TC/GA/T C/T G/A C/A G/T C/G A/T

Ancestral Polymorphism Repeated initial analysis with macaque data. Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.

Ancestral Polymorphism Repeated initial analysis with macaque data. Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms. ObsExpRatio All (1.27,2.00) No-CpG (1.001,2.02)

Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs

Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.

Paralogous SNPs Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions. Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy.

Paralogous SNPs Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions. Musumeci et al (2010): 8.32% of human variation in dbSNP may be due to paralogy. AGCTGCACGT Y CGGCATCCAA SNP AGCTGCACGT T CGGCATCCAA Chromosome 1 AGCTGCACGT A CGGCATCCAA Chromosome 7 Artifactual SNP

Paralogous SNPs AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA

Paralogous SNPs AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT (T/A) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA 3.6% of coincident SNPs are possibly a consequence of paralogous sequences

Alternative Explanations Bias in the method Selection Ancestral Polymorphism Paralogous SNPs Cryptic variation in the mutation rate

Context Analysis 4517 sequences containing non-CpG coincident SNPs flanked by 200bp. Tabulate triplet frequencies at each position in surrounding sequences. Test whether the proportions of triplets we observe at each position significantly different from the proportions in the sequences as a whole.

Context Analysis Coincident SNP in central position:

Context Analysis Coincident SNP in central position: No obvious context surrounding coincident SNPs

Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB.

Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB. If randomly distributed expect Poisson distribution and  =  2 = 3.91

Genomic Distribution Tallied the number of coincident SNPs per MB: coincident SNPs per MB non-CpG coincident SNPs per MB. If randomly distributed expect Poisson distribution and  =  2 = 3.91  2 = (p<0.001) and so sampling variance explains approximately 30% of total variance.

Genomic Distribution Featurerr2r2 p SNP density <0.001** Distance to Telomere Distance to Centromere Recombination Rate <0.001** Nucleosome Association Gene Density GC content

Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.

Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone. Recombination rate positively correlated with SNP density (r = 0.242, p<0.001). Partial correlation controlling for SNP density: r = 0.048, p=0.011**.

Genomic Distribution SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone. Recombination rate positively correlated with SNP density (r = 0.242, p<0.001). Partial correlation controlling for SNP density: r = 0.048, p=0.011**. SNP densities explain 6.5% of the variance, recombination rate explains 0.2% of the variance of coincident SNPs.

Genomic Distribution Featurerr2r2 p Coincident SNP Density <0.001** Distance to Telomere <0.001** Distance to Centromere ** Recombination Rate <0.001** Nucleosome Association <0.001** Gene Density ** GC content <0.001**

Quantification Use Log-normal distribution of relative mutation rates due to cryptic variation. Model the number of coincident SNPs under the effects of cryptic variation. Incorporate effects of divergence.

Quantification Use Log-normal distribution of relative mutation rates due to cryptic variation. Model the number of coincident SNPs under the effects of cryptic variation. Incorporate effects of divergence. What level of variation in the log-normal distribution explains our results?

Log-normal model Fastest 5% of sites mutate ~16.4 times faster than slowest 5% of sites.

Summary Cryptic variation in the mutation rate.

Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs.

Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic.

Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic. Genomic distribution of coincident SNPs is over-dispersed

Summary Cryptic variation in the mutation rate. No obvious context surrounding coincident SNPs. Variation is truly cryptic. Genomic distribution of coincident SNPs is over-dispersed Variation in mutation rate is substantial.

Acknowledgments Manolis Ladoukakis BBSRC People: Adam Eyre-Walker