Complex Trait Genetics in Animal Models Will Valdar Oxford University
Mapping Genes for Quantitative Traits in Outbred Mice Will Valdar Oxford University
What’s so great about mice? Share ~99% of genes with humans ~90% of the two genomes can be portioned into regions of conserved synteny Shorter lifespans You can do invasive experiments You can breed them as you like – control the genetics
What is an inbred strain? F20 generation F = 99% BALB/c
http://www.informatics.jax.org/mgihome/genealogy/
one inbred strain
two inbred strains
Mouse model of anxiety Mice are not natural predators, they are prey to larger animals. And as such, they like small dark places. This is an open field arena, which is a brightly lit open box, which we would consider would be anxiety-producing to a mouse. We placed the mouse into this arena, and we use a video tracking device to monitor its behaviour.
Mouse model of anxiety Anxious mouse Non anxious mouse On the right, I show the movements of the mouse. The red lines indicate movement and the green spots indicate freezing positions. You can see that the mouse in the top panel is thoroughly exploring it’s new territory, moving all about the enclosure including the central area away from the relatively “safe” walls of the open field arena. On the bottom panel, you can see a completely contrasting behaviour. The mouse has entered through the door on the right hand side, and has frozen as soon as it has entered the arena, it is not exploring it’s new territory at all. An additional measure of anxiety is defecation.
F2 cross Generation F0 F1 F2
anxious about average non-anxious Suppose we have a quantitative phenotype that is influenced by a genetic variant. For example, say our normally distributed phenotype is affected by a QTL accounting for some small fraction of the variance, 10% or so. If we plotted a histogram of the phenotype, it would look something like this. However, assuming the QTL had two alleles and acted in an additive manner, we would in fact be looking at three populations. non-anxious
Quantitative Trait Locus anxious about average Those individuals homozygous for the low allele. non-anxious
Quantitative Trait Locus anxious about average Those who are homozygous for the high allele. non-anxious
Quantitative Trait Locus anxious about average And those who are heterozygous, and who tend to have an intermediate phenotype. non-anxious
Quantitative Trait Locus anxious about average Here are the three genotypes, and the first step is to encode these some way so they can be correlated with the phenotype. non-anxious QTL snp
Linear models Also known as ANOVA ANCOVA regression multiple regression linear regression To keep it general, we’ll be using linear models. ANOVA, ANCOVA, regression, mulitple regression and linear regression. These are different names that people use for linear models, but of course they’re all the same thing.
+1 -1 A simple way is to represent the snp by x and let x equal -1, 0 and +1 for the three genotypes in turn. QTL snp
+1 -1 Having done this, it’s natural to model the effect of the genotype on the phenotype as a linear model, where y, representing the phenotype is equal to mu plus ax plus episilon… QTL snp
+1 -1 mu defines the mean QTL snp
+1 -1 a then describes the average effect of adding or taking away copies of the high allele, and epsilon represents the error and accounts for the fact that animals with the same genotype may still have different phenotypes. QTL snp
Hypothesis testing H0: H1: Of course, the way we go about testing whether there is a significant effect of genotype is by comparing two models: one representing a null hypothesis and the other an alternative hypothesis… the phenotype is influenced by the mean plus some error vs the phenotype is influenced by the mean plus the genotype plus error.
Hypothesis testing H0: y ~ 1 H1: y ~ 1 + x And the way we write those formulas in R is like this. y twiddles 1, y is influenced by the mean, y twiddles 1 plus x.
Hypothesis testing H0: y ~ 1 H1: y ~ 1 + x H1 vs H0 : Does x explain a significant amount of the variation? Our model comparison amounts to asking the question “does x explain a significant amount of the variation” , or Does the larger model fit the data significantly better than the smaller model given the number of parameters required for that improvement. The comparison of the two models yeilds a likelihood ratio, which can be written as a LOD score if you like.
Hypothesis testing H0: y ~ 1 H1: y ~ 1 + x H1 vs H0 : Does x explain a significant amount of the variation? LOD score Our model comparison amounts to asking the question “does x explain a significant amount of the variation” , or Does the larger model fit the data significantly better than the smaller model given the number of parameters required for that improvement. The comparison of the two models yeilds a likelihood ratio, which can be written as a LOD score if you like. likelihood ratio
Hypothesis testing H0: y ~ 1 H1: y ~ 1 + x H1 vs H0 : Does x explain a significant amount of the variation? LOD score Or more meaningfully used in a chi-square test, yeilding a p-value, which we often find convenient to express as a on the log scale as a logP. likelihood ratio Chi Square test p-value logP
Hypothesis testing H0: y ~ 1 H1: y ~ 1 + x H1 vs H0 : Does x explain a significant amount of the variation? LOD score In the special case of linear models we could do a likelihood ratio test, but it turns out to be more accurate to use explained and unexplained sums of squares leading to an F-test, and to get our p-value and logP that way. likelihood ratio Chi Square test p-value logP linear models only SS explained / SS unexplained F-test (or t-test)
Hypothesis testing H0: y ~ 1 + x1 H1: y ~ 1 + x1 + x2 H1 vs H0 : Does x2 explain a significant amount of the variation after accounting for x1? or is x2 significant conditional on x1? Our model comparison amounts to asking the question “does x explain a significant amount of the variation” , or Does the larger model fit the data significantly better than the smaller model given the number of parameters required for that improvement. The comparison of the two models yeilds a likelihood ratio, which can be written as a LOD score if you like.
F2 cross Generation F0 F1 F2
QTL This is the chromosome where the QTL is and all those lines represent every snp on that chromosome. The marked snp is the QTL. 29
QTL logP If we perform our hypothesis test at every snp and plot them on a graph we get something like this. Snps that have no influence on the trait have low scores and the QTL snp has the highest score. But what you’ll also notice is that the snps around the QTL also have high scores. 30
highly significant snp QTL logP If we perform our hypothesis test at every snp and plot them on a graph we get something like this. Snps that have no influence on the trait have low scores and the QTL snp has the highest score. But what you’ll also notice is that the snps around the QTL also have high scores. highly significant snp non significant snp 31
QTL logP If we perform our hypothesis test at every snp and plot them on a graph we get something like this. Snps that have no influence on the trait have low scores and the QTL snp has the highest score. But what you’ll also notice is that the snps around the QTL also have high scores. 32
QTL logP Genotype every animal at snp 1 and compare the 33
Chromosome scan for F2 Typical chromosome QTL goodness of fit (logP) significance threshold 200Mb position along whole chromosome (Mb) Typical chromosome
Advanced intercross lines (AILs) F0 F1 F2 F3 F4 Darvasi & Soller (1995) Genetics
F12 cross F0 F1 F2 F12
Chromosome scan for F12 Typical chromosome QTL goodness of fit (logP) significance threshold 200Mb position along whole chromosome (Mb) Typical chromosome
Practical Fitting a linear model to test a marker-phenotype association Single marker association on an F2 Permutation test Single marker association on an AIL (F12) Conditional modelling of loci Start Firefox, File->Open and go to F:\valdar\ThursdayAfternoonAnimals\practical.R Start R
Practical: F2 cross Bonferroni = 2.6 permutation ~ 2.1 uncorrected = 1.3
Practical: F2 cross Bonferroni = 2.6 permutation ~ 2.1 uncorrected = 1.3 generalized extreme value (GEV) distribution
Practical: F12 phenotype ~ MARKER
Practical: F12 phenotype ~ MARKER phenotype ~ m37 + MARKER
Practical: F12 phenotype ~ MARKER phenotype ~ m37 + MARKER
Practical: F12 phenotype ~ MARKER phenotype ~ m37 + MARKER
Practical: F12 phenotype ~ MARKER phenotype ~ m37 + MARKER phenotype ~ m37 + m29 + MARKER
F2 F18 F18 multilocus approach
F2 F18 F18 multilocus approach
F2 breeding in a small population F18 F18 multilocus approach
F2 population structure gross genetic differences between groups where groups = families F18 multilocus approach
F2 population structure gross genetic differences between groups where groups = families multilocus approach
Heterogeneous Stocks The title of my talk refers to the mouse version of human association studies. In fact the project I’m going to describe could been seen as something in between an association study and an inbred line cross. The population we’ve used are the heterogeneous stock mice pictured here, also known as the HS. 51
Heterogeneous Stock Pseudo-random mating for 50 generations Avg. Distance Between Recombinations: HS is an outbred population large number of recombinants HS has 8 progenitors greater haplotypic diversity more likely QTLs will segregate HS ~2 cM
Heterogeneous Stock F2 Intercross x Pseudo-random mating for 50 generations F1 Avg. Distance Between Recombinations: HS is an outbred population large number of recombinants HS has 8 progenitors greater haplotypic diversity more likely QTLs will segregate HS ~2 cM F2 intercross ~30 cM F2
124 Phenotypes Anxiety [24] Asthma [13] Biochemistry [15] Bone Morphology [23] Diabetes [16] Haematology [15] Immunology [9] Weight/size related [8] Wound Healing [1] 54 54
Intraperitoneal Glucose Tolerance Test 55
How to select peaks: a simulated example Reallistic example
How to select peaks: a simulated example Simulate 7 x 5% QTLs (ie, 35% genetic effect) + 20% shared environment effect + 45% noise = 100% variance We’ve estimated 20% cage effects
Simulated example
Say forward selection phenotype ~ ?
condition on 1 peak phenotype ~ peak 1 + ?
condition on 2 peaks phenotype ~ peak 1 + peak 2 + ?
condition on 3 peaks phenotype ~ peak 1 + peak 2 + peak 3 + ?
condition on 4 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + ?
condition on 5 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + ?
condition on 6 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + ?
condition on 7 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + peak 7 + ?
condition on 8 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + ?
condition on 9 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9 + ?
condition on 10 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9 + peak 10 + ?
condition on 11 peaks phenotype ~ peak 1 + peak 2 + peak 3 + peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9 + peak 10 + peak 11 + ?
Peaks chosen by forward selection We have recorded all the QTLs in this simulation, but some false positives. Also in other simulations we missed some. How can we tell which peaks are genuine? One way to think about this sis suppose we had chosen a slightly different set of HS mice: the peak heights would have been different and our forward selection would have chosen slightly different set of peaks.
Bootstrap sampling 1 2 3 4 10 subjects 5 6 7 8 9 10 To deal with this we’ve taken a bootstrapping approach. Let me explain what this is.
Bootstrap sampling sample with replacement 1 2 3 4 5 6 7 8 9 10 1 2 3 bootstrap sample from 10 subjects 10 subjects So what I will do is create a bootstrap of the 2000 mice and repeat the forward selection
Forward selection on a bootstrap sample
Forward selection on a bootstrap sample
Forward selection on a bootstrap sample
Bootstrap evidence mounts up…
In 1000 bootstraps… Model Inclusion Probability
854 loci in all phenotypes, 84 diabetes loci
854 loci in all phenotypes, 84 diabetes loci
Servin B, Stephens M (2007)
Bayesian Multiple QTL modelling Kilpikari R, Sillanpaa MJ (2003) Bayesian analysis of multilocus association in quantitative and qualitative traits. Genet Epidemiol 25: 122-135 Yi N (2004) A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167: 967-975 Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3: e114 Fridley BL (2008) Bayesian variable and model selection methods for genetic association studies. Genet Epidemiol.
The Collaborative Cross Heterogeneous Stocks (HS) Collaborative Cross outbreed outbred population recombinant inbred lines Churchill et al 2004; Broman 2005; Valdar et al 2006