Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011.

Similar presentations


Presentation on theme: "Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011."— Presentation transcript:

1 Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011

2 IMSGC, WTCCC2 (2011)

3 Questions to ask… What is a linear model? What is a generalised linear model? How do you estimate parameters and test hypotheses with GLMs? How are GLMs used in the study of genetic association? 3

4 What is a covariate? A covariate is a quantity that may influence the outcome of interest –Genotype at a SNP –Age of mice when measurement was taken –Batch of chips from which gene expression was measured Previously, you have looked at using likelihood to estimate parameters of underlying distributions We want to generalise this idea to ask how covariates might influence the underlying parameters Much statistical modelling is concerned with considering linear effects of covariates on underlying parameters 4

5 What is a linear model? In a linear model, the expectation of the response variable is defined as a linear combination of explanatory variables Explanatory variables can include any function of the original data But the link between E(Y) and X (or some function of X ) is ALWAYS linear and the error is ALWAYS Gaussian Response variable InterceptLinear relationships with explanatory variables Interaction term Gaussian error 5

6 Quick test: which of these is NOT a linear model 6

7 What is a GLM? There are many settings where the error is non-Gaussian and/or the link between E(Y) and X is not necessarily linear –Discrete data (e.g. counts in multinomial or Poisson experiments) –Categorical data (e.g. Disease status) –Highly-skewed data (e.g. Income, ratios) Generalised linear models keep the notion of linearity, but enable the use of non- Gaussian error models g is called the link function –In linear models, the link function is the identity The response variable can be drawn from any distribution of interest (the distribution function) –In linear models this is Gaussian 7

8 Poisson regression In Poisson regression the expected value of the response variable is given by the exponent of the linear term The link function is the log Note that several distribution functions are possible (normal, Poisson, binomial counts), though in practice Poisson regression is typically used to model count data (particularly when counts are low) 8

9 Example: Caesarean sections in public and private hospitals 9

10 Boxplots of rates of C sections 10

11 Fitting a model without covariates 11 > analysis<-glm(d$Caes ~ d$Births, family = "poisson") > summary(analysis) Call: glm(formula = d$Caes ~ d$Births, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max -2.81481 -0.73305 -0.08718 0.74444 2.19103 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.132e+00 1.018e-01 20.949 < 2e-16 *** d$Births 4.406e-04 5.395e-05 8.165 3.21e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 99.990 on 19 degrees of freedom Residual deviance: 36.415 on 18 degrees of freedom AIC: 127.18 Number of Fisher Scoring iterations: 4 Implies an average of 12.7 per 1000 births

12 Fitting a model with covariates Unexpectedly, this indicates that public hospitals actually have a higher rate of Caesarean sections than private ones 12 glm(formula = d$Caes ~ d$Births + d$Hospital, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max -2.3270 -0.6121 -0.0899 0.5398 1.6626 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.351e+00 2.501e-01 5.402 6.58e-08 *** d$Births 3.261e-04 6.032e-05 5.406 6.45e-08 *** d$Hospital 1.045e+00 2.729e-01 3.830 0.000128 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Implies an average of 15.2 per 1000 births in public hospitals and 5.4 per 1000 births in private ones

13 Checking model fit Look a distribution of residuals and how well observed values are predicted 13

14 What’s going on? Initial summary suggested opposite result to GLM analysis. Why? Relationship between no. Births and no. C sections does not appear to be linear Accounting for this removes most (but not all) of the apparent differences between hospital types There is also one quite influential outlier 14 NormalCaesarean Private41216 Public21121285 Relative risk Private (compared to public) = 2.8

15 15 Hospitals with fewer births tend to have more Caesarean sections

16 Simpson’s paradox Be careful about adding together observations – this can be misleading E.g. Berkeley sex-bias case 16 Applicants % admitted Men844244% Women432135% MajorMenWomen Applicants % admittedApplicants % admitted A82562%10882% B56063%2568% C32537%59334% D41733%37535% E19128%39324% F2726%3417% Appears that women have lower success But actually women are typically more successful at the Departmental level, just apply to more competitive subjects

17 Finding MLEs in GLM In linear modelling we can use the beautiful compactness of linear algebra to find MLEs and estimates of the variance for parameters Consider an n by k+1 data matrix, X, where n is the number of observations and k is the number of explanatory variables, and a response vector Y –the first column is ‘1’ for the intercept term The MLEs for the coefficients (  ) can be estimated using In GLMs, there is usually no such compact analytical expression for the MLEs –Use numerical methods to maximise the likelihood 17

18 Testing hypotheses in GLMs For the parameters we are interested in we typically want to ask how much evidence there is that these are different from zero For this we need to construct confidence intervals for the parameter estimates We could estimate the confidence interval by finding all parameter values with log-likelihood no greater than 1.94 units worse than the MLE Alternatively, we might use bootstrap resampling techniques to estimate the distribution of parameter estimates However, we can also appeal to theoretical considerations of likelihood (based on the CLT) that show that parameter estimates are asymptotically normal with variance described by the Fisher information matrix Informally, the information matrix describes the sharpness of the likelihood curve around the MLE and the extent to which parameter estimates are correlated 18

19 Logistic regression When only two types of outcome are possible (e.g. disease/not-disease) we can model counts by the binomial If we want to perform inference about the factors that influence the probability of ‘success’ it is usual to use the logistic model The link function here is the logit 19

20 Example: testing for genotype association In a cohort study, we observe the number of individuals in a population that get a particular disease We want to ask whether a particular genotype is associated with increased risk The simplest test is one in which we consider a single coefficient for the genotypic value GenotypeAAAaAA Genotypic value 012 Frequency in population 22  2 Probability of disease p0p0 p1p1 p2p2 0 1 2  0 = -4  1 = 2 20

21 A note on the model Note that each copy of the risk allele contribute in an additive way to the exponent This does not mean that each allele ‘adds’ a fixed amount to the probability of disease Rather, each allele contributes a fixed amount to the log-odds This has the effect of maintaining Hardy-Weinberg equilibrium within both the cases and controls 21

22 Concepts in disease genetics Relative risk describes the risk to a person in an exposed group compared to the unexposed group The odds ratio compares the odds of disease occurring in one group relative to another If the absolute risk of disease is low the two will be very similar 22

23 Cont. Suppose in a given study we observe the following counts We can fit a GLM using the logit link function and binomial probabilities We have genotype data stored in the vector gt and disease status in the vector status Using R, this is specified by the command –glm(formula = status ~ gt, family = binomial) Genotype012 Counts with disease 263921 Counts without disease 129856749 23

24 An important note In case-control designs, it is actually the genotype that is the random variable, not the outcome There is some theory that says that estimates of coefficients are equivalent under prospective or retrospective approaches in CC designs –Prentice & Pyke (1979) Biometrika 66:403. However, the two approaches are not fully equivalent as the CC design creates artificial association between causal factors that are independent in the population at large 24

25 Interpreting results Call: glm(formula = status ~ gt, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.8554 -0.4806 -0.2583 -0.2583 2.6141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.6667 0.2652 -17.598 <2e-16 *** gt 1.2833 0.1407 9.123 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 967.36 on 1999 degrees of freedom Residual deviance: 886.28 on 1998 degrees of freedom AIC: 890.28 Number of Fisher Scoring iterations: 6 MLE for coefficients Standard error for estimates Estimate/std. error P-value from normal distribution Measure of contribution of individual observations to overall goodness of fit (for MLE model) Measure of goodness of fit of null (compared to saturated model) Measure of goodness of fit of fitted model Penalised likelihood used in model choice Number of iterations used to find MLE 25

26 Adding in extra covariates Adding in additional explanatory variables in GLM is essentially the same as in linear model analysis Likewise, we can look at interactions In the disease study we might want to consider age as a potentially important covariate glm(formula = status ~ gt + age, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.00510 0.79209 -8.844 <2e-16 *** gt 1.46729 0.17257 8.503 <2e-16 *** age 0.04157 0.01903 2.185 0.0289 * 26

27 Adding model complexity In the disease status analysis we might want to generalise the fitted model to one in which each genotype is allowed its own risk glm(formula = status ~ g1 + g2 + age, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.46870 0.73155 -7.476 7.69e-14 *** g1TRUE 1.25694 0.26278 4.783 1.72e-06 *** g2TRUE 3.00816 0.33871 8.881 < 2e-16 *** age 0.04224 0.01915 2.206 0.0274 * --- Null deviance: 684.44 on 1999 degrees of freedom Residual deviance: 609.09 on 1996 degrees of freedom AIC: 617.09 27

28 Model choice It is worth remembering that we cannot simply identify the ‘significant’ parameters and put them in our chosen model Significance for a parameter tests the marginal null hypothesis that the coefficient associated with that parameter is zero If two explanatory variables are highly correlated then marginally neither may be significant, yet a linear model contains only one would be highly significant There are several ways to choose appropriate models from data. These typically involve adding in parameters one at a time and adding some penalty to avoid over-fitting. 28

29 The WTCCC One of the first large-scale genome-wide association studies, carried out on 7 diseases using c. 2,000 cases from a variety of diseases and shared controls –Type 1 diabetes –Type 2 diabetes –Rheumatoid arthritis –Hypertension –Bipolar disorder –Crohn’s disease –Coronary artery disease Results published in Nature (2007) This was an important study because it established a series of common statistical principles used in the analysis of GWAS studies –As well as identifying many novel regions associated with common complex diseases 29

30 Checking for differentiation between control groups 30 Manhattan plots Sampling distributions for QQplots

31 A rationale for setting a threshold for ‘genome-wide significance’ 31 P < 5 x 10 -7

32 A method for ‘imputing’ data from additional references resources 32

33 Coping with population structure Genomic control PCA Mixed models 33

34 Many open questions Given the data collected, where do we think the causal mutation is? To what extent can we ever ‘fine-map’ the causal variant in –The population we start with? –Other populations? Can we use resequencing or imputation from (e.g. 1000 Genomes Project) to fine- map without additional worK? 34


Download ppt "Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011."

Similar presentations


Ads by Google