Presentation is loading. Please wait.

Presentation is loading. Please wait.

Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics

Similar presentations


Presentation on theme: "Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics"— Presentation transcript:

1 Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics 2015 @mgymrek

2 Genetic variation comes in many forms ACGACTCGAGCG ACGACACGAGCG μ SNP : 1.20 × 10 -8 /loc/gen SNP ACGACTCGAGCG ACGAC-CGAGCG μ INDEL : 0.68 × 10 -9 /loc/gen Short indel (1-20bp) Short tandem repeat CAGCAG--- CAGCAGCA CAGCAGCAGCAGCA GCA μ STR : 10 -2 -10 -5 /loc/gen Alu retrotransposition Alu Struct. Var /CNV (>20bp) STR 500 Alu 0.05 SV 0.2 Indel 3 SNP 50 # de novo/gen STR 500 Alu 0.05 SV 0.2 Indel 3 SNP 50 0 100200300400500 # de novo/gen 0 100200300400500 Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

3 eSTRs contribute to gene expression variability Observed p-value [-log10] Expected p-value under the null [-log10] Gene(TG) STR Expression Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

4 Why study the STR mutation process? 1. Identify rapidly mutating STRs 2. Understand biological processes driving mutation patterns 3. Identify STRs under selective pressure Haasl and Payseur 2013 H 0 : Locus evolves under neutral model H 1 : Locus is under selection Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

5 STRs and SNPs provide orthogonal molecular clocks TIME Clock 1: SNPs Clock 2: STRs # mismatches ~ f(μ SNP, t, …) t (m-n) 2 ~ f(μ STR, t, …) ACCCATCCTAGCTACCGACTACAACG ACCGATCCTAGCTTCCGACTACCACG ACACTCATCTG(CAG) m ACACACTGA ACACTCATCTG(CAG) n ACACACTGA Use known value of μ SNP to calibrate the STR molecular clock μ STR : STR mutation rate (/loc/gen) t: Time to the most recent common ancestor (TMRCA) μ SNP : SNP mutation rate (/loc/gen) Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

6 Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP whole genomes CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

7 We are now armed with deep WGS amenable to STR profiling SGDP: 300 deeply sequenced, PCR free genomes with diverse origins Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

8 Generating high quality STR genotypes Alignment Sample 1Sample 2Sample n Alignment FASTQ BAM BWA-MEM Allelotype (multi-sample) lobSTR VCF Filtering Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015 lobSTR

9 High coverage genomes provide accurate STR genotypes Homopolymers (n=50,398) R 2 =0.92 93% concordance with capillary data Accurately recover population structure http://strcat.teamerlich.org/ Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015 (e.g. AAAAAA)

10 Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP samples CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro.STR catalog PSMC Mutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

11 Measuring TMRCA using PSMC Discretized TMRCA Li and Durbin, Nature 2011 Maternal chromosome Paternal chromosome CAG m CAG n Measure local TMRCA Intro.STR catalog PSMC Mutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015

12 Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP samples CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

13 What we know about STR mutations 1. Mutate in “unit” lengths 2. Step size distribution ~Geometric 3. Length constraint biases mutation direction 4. Other important factors not modeled here CAGCAGCAGCAGCAGCAGCAGCAG CAGCAGCAG---CAGCAGCAGCAG CAGCAG------CAGCAGCAGCAG CAGCAG---CAGCAGCAGCAGCAG CAGCAGCA-CAGCAGCAGCAGCAG Sun et al. 2012 short alleles longershorter longer Length-dependent mutation rate Motif sequence interruptions Large expansions behave differently (e.g. Huntington’s) Biased gene conversion? Interaction between alleles? P: probability of mutating a single step 3, 6 4, 4 Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

14 Modeling STR mutation as a mean-centered random walk Simple Stepwise Model (SMM): mutate by +/- 1 copy of the repeat unit with probability μ t CAG MRCA CAG m CAG n m m n Observed (Sun et al. 2012) Mean-centered random walk (Ohrnstein-Uhlenbeck) : m n μ STR : Mutation rate (per generation) β : Length constraint (0 ≤ β ≤ 1) Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015 β

15 Estimating the step size distribution 0+5-5 (Mean allele length) 1234 0.2 0.4 0.6 0.8 Step size (# units) Frequency +1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4+1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4+1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4 p: Probability that the step size is a single unit. Tetranucleotides: p = ~0.95 Dinucleotides: p = ~0.7 Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

16 Model validation using Y-STRs Thomas Willems Find maximum likelihood mutation parameters (1000 Genomes Project): P(STR data | Y phylogeny, μ, β, σ) Validation set: Ballantyne et al (~2,000 father-son pairs) Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015 Ballantyne, et al. lobSTR r=0.831, N=64

17 Estimating mutation parameters at autosomal loci TMRCA ASD 0 4 9 16 CAG 5 Individual 1 CAG 5 CAG 8 Individual 2 CAG m CAG n Individual k Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

18 Per-locus estimation of STR mutation parameters Estimates for 120K multi-allelic STRs Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

19 STR mutation trends by motif length Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

20 Future directions: a genome-wide scan for STR selection ExpectedObserved Features Motif lengthRecomb. rate Total length GC content Linear model Predict μ, β Explain: 46% of variation in μ 4.6% of variation in β Develop genome-wide scan STR selection scan Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

21 Conclusion The first genome-wide characterization of STR mutation 1.STR mutation model 2.Validation against published de novo mutation rates 3.Strong effect of local sequence features 4.Future work: improve estimation, genome-wide selection scan An unexplored, important source of genetic variation Intro.STR catalogPSMCMutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015

22 Yaniv Erlich David Reich Mark Daly Nick Patterson Swapan Mallick Thomas Willems Alon Goren Acknowledgements


Download ppt "Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics"

Similar presentations


Ads by Google