Download presentation
Presentation is loading. Please wait.
Published byPiers Hopkins Modified over 8 years ago
1
Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics 2015 @mgymrek
2
Genetic variation comes in many forms ACGACTCGAGCG ACGACACGAGCG μ SNP : 1.20 × 10 -8 /loc/gen SNP ACGACTCGAGCG ACGAC-CGAGCG μ INDEL : 0.68 × 10 -9 /loc/gen Short indel (1-20bp) Short tandem repeat CAGCAG--- CAGCAGCA CAGCAGCAGCAGCA GCA μ STR : 10 -2 -10 -5 /loc/gen Alu retrotransposition Alu Struct. Var /CNV (>20bp) STR 500 Alu 0.05 SV 0.2 Indel 3 SNP 50 # de novo/gen STR 500 Alu 0.05 SV 0.2 Indel 3 SNP 50 0 100200300400500 # de novo/gen 0 100200300400500 Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
3
eSTRs contribute to gene expression variability Observed p-value [-log10] Expected p-value under the null [-log10] Gene(TG) STR Expression Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
4
Why study the STR mutation process? 1. Identify rapidly mutating STRs 2. Understand biological processes driving mutation patterns 3. Identify STRs under selective pressure Haasl and Payseur 2013 H 0 : Locus evolves under neutral model H 1 : Locus is under selection Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
5
STRs and SNPs provide orthogonal molecular clocks TIME Clock 1: SNPs Clock 2: STRs # mismatches ~ f(μ SNP, t, …) t (m-n) 2 ~ f(μ STR, t, …) ACCCATCCTAGCTACCGACTACAACG ACCGATCCTAGCTTCCGACTACCACG ACACTCATCTG(CAG) m ACACACTGA ACACTCATCTG(CAG) n ACACACTGA Use known value of μ SNP to calibrate the STR molecular clock μ STR : STR mutation rate (/loc/gen) t: Time to the most recent common ancestor (TMRCA) μ SNP : SNP mutation rate (/loc/gen) Intro. STR catalogPSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
6
Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP whole genomes CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
7
We are now armed with deep WGS amenable to STR profiling SGDP: 300 deeply sequenced, PCR free genomes with diverse origins Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
8
Generating high quality STR genotypes Alignment Sample 1Sample 2Sample n Alignment FASTQ BAM BWA-MEM Allelotype (multi-sample) lobSTR VCF Filtering Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015 lobSTR
9
High coverage genomes provide accurate STR genotypes Homopolymers (n=50,398) R 2 =0.92 93% concordance with capillary data Accurately recover population structure http://strcat.teamerlich.org/ Intro. STR catalog PSMCMutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015 (e.g. AAAAAA)
10
Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP samples CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro.STR catalog PSMC Mutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
11
Measuring TMRCA using PSMC Discretized TMRCA Li and Durbin, Nature 2011 Maternal chromosome Paternal chromosome CAG m CAG n Measure local TMRCA Intro.STR catalog PSMC Mutation processConclusion 10/29/15Melissa GymrekGenome Informatics 2015
12
Estimating STR mutation parameters from WGS TMRCA STR calls 300 high coverage SGDP samples CAG m CAG n PSMC (Li and Durbin 2011) SNPs TMRCA Infer locus specific mutation params. L k TMRCA (m-n) 2 Step size Frequency Learn model to predict mutation parameters from sequence features Diploid locus Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
13
What we know about STR mutations 1. Mutate in “unit” lengths 2. Step size distribution ~Geometric 3. Length constraint biases mutation direction 4. Other important factors not modeled here CAGCAGCAGCAGCAGCAGCAGCAG CAGCAGCAG---CAGCAGCAGCAG CAGCAG------CAGCAGCAGCAG CAGCAG---CAGCAGCAGCAGCAG CAGCAGCA-CAGCAGCAGCAGCAG Sun et al. 2012 short alleles longershorter longer Length-dependent mutation rate Motif sequence interruptions Large expansions behave differently (e.g. Huntington’s) Biased gene conversion? Interaction between alleles? P: probability of mutating a single step 3, 6 4, 4 Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
14
Modeling STR mutation as a mean-centered random walk Simple Stepwise Model (SMM): mutate by +/- 1 copy of the repeat unit with probability μ t CAG MRCA CAG m CAG n m m n Observed (Sun et al. 2012) Mean-centered random walk (Ohrnstein-Uhlenbeck) : m n μ STR : Mutation rate (per generation) β : Length constraint (0 ≤ β ≤ 1) Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015 β
15
Estimating the step size distribution 0+5-5 (Mean allele length) 1234 0.2 0.4 0.6 0.8 Step size (# units) Frequency +1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4+1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4+1+2+3+4 Step size (# units) 0.1 0.2 0.3 0.4 Frequency -2-3-4 p: Probability that the step size is a single unit. Tetranucleotides: p = ~0.95 Dinucleotides: p = ~0.7 Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
16
Model validation using Y-STRs Thomas Willems Find maximum likelihood mutation parameters (1000 Genomes Project): P(STR data | Y phylogeny, μ, β, σ) Validation set: Ballantyne et al (~2,000 father-son pairs) Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015 Ballantyne, et al. lobSTR r=0.831, N=64
17
Estimating mutation parameters at autosomal loci TMRCA ASD 0 4 9 16 CAG 5 Individual 1 CAG 5 CAG 8 Individual 2 CAG m CAG n Individual k Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
18
Per-locus estimation of STR mutation parameters Estimates for 120K multi-allelic STRs Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
19
STR mutation trends by motif length Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
20
Future directions: a genome-wide scan for STR selection ExpectedObserved Features Motif lengthRecomb. rate Total length GC content Linear model Predict μ, β Explain: 46% of variation in μ 4.6% of variation in β Develop genome-wide scan STR selection scan Intro.STR catalogPSMC Mutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
21
Conclusion The first genome-wide characterization of STR mutation 1.STR mutation model 2.Validation against published de novo mutation rates 3.Strong effect of local sequence features 4.Future work: improve estimation, genome-wide selection scan An unexplored, important source of genetic variation Intro.STR catalogPSMCMutation process Conclusion 10/29/15Melissa GymrekGenome Informatics 2015
22
Yaniv Erlich David Reich Mark Daly Nick Patterson Swapan Mallick Thomas Willems Alon Goren Acknowledgements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.