Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.

Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics

Synopsis Logistic Regression Mixed Models

Logistic Regression and Statistical Genetics Applicable to Association Studies Data: – Binary outcomes (eg disease status) – Dependent on genotypes [+ sex, environment] Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test

Data Data for individual labelled i=1…n: – Binary Response y i – Unphased Genotypes g ij for markers j=1..m

Coding Unphased Genotypes Several possibilities: – AA, AG, GG original genotypes – 12, 21, 22 – 1, 2, 3 – 0, 1, 2 # of G alleles Missing Data – NA default in R

Using R Load genetic logistic regression tools > source(‘logistic.R’) Read data table from file – > t <- read.table(‘geno.dat’, header=TRUE) Column names – names(t) – t$y response (0,1) – t$m1, t$m2, …. Genotypes for each marker

Contigency Tables in R ftable(t$y,t$m31) prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 >

Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

The Logistic Model Effect of SNP k on phenotype Null model

The Logistic Model Prob(Y i =1) = exp(  i  exp(  i ))  i =  j x ij b j - Linear Predictor x ij – Design Matrix (genotypes etc) b j – Model Parameters (to be estimated) Model is investigated by – estimating the b j ’s by maximum likelihood – testing if the estimates are different from 0

The Logistic Function Prob(Y i =1) = exp(  i  exp(  i ))  Prob(Y=1)

Types of genetic effect at a single locus AAAGGG Recessive001 Dominant110 Additive012 Genotype0 

Additive Genotype Model Code genotypes as – AA x=0, – AG x=1, – GG x=2 Linear Predictor –  = b 0 + xb 1 P(Y=1|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P(Y=1|x=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=1|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=1|x=2) = exp(b 0 + 2b 1 )/(1+exp(b 0 + 2b 1 ))

Additive Model: b 0 = -2 b 1 = 2 P AA = 0.12 P AG = 0.50 P GG = 0.88  Prob(Y=1)

Additive Model: b 0 = 0 b 1 = 2 P AA = 0.50 P AG = 0.88 P GG = 0.98  Prob(Y=1)

Recessive Model Code genotypes as – AA x=0, – AG x=0, – GG x=1 Linear Predictor –  = b 0 + xb 1 P(Y=1|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P AG = P(Y=1|x=0) = exp(b 0 )/(1+exp(b 0 )) P GG = P(Y=1|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 ))

Recessive Model: b 0 = 0 b 1 = 2 P AA = P AG = 0.50 P GG = 0.88  Prob(Y=1)

Genotype Model Each genotype has an independent effect Code genotypes as (for example) – AA x=0, z=0 – AG x=1, z=0 – GG x=0, z=1 Linear Predictor –  = b 0 + xb 1 +zb 2 two parameters P(Y=1|x) = exp(b 0 + xb 1 +zb 2 )/(1+exp(b 0 + xb 1 +zb 2 )) P AA = P(Y=1|x=0,z=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=1|x=1,z=0) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=1|x=0,z=1) = exp(b 0 + b 2 )/(1+exp(b 0 + b 2 ))

Genotype Model: b 0 = 0 b 1 = 2 b 2 = -1 P AA = 0.5 P AG = 0.88 P GG = 0.27  Prob(Y=1)

Models in R response y genotype g AAAGGGmodelDF Recessive001y ~ recessive(g)1 Dominant011y ~ dominant(g)1 Additive012y ~ additive(g)1 Genotype0  y ~ genotype(g)2

Data Transformation g <- t$m1 use these functions to treat a genotype vector in a certain way: – a <- additive(g) – r <- recessive(g) – d <- dominant(g) – g <- genotype(g)

Fitting the Model afit <- glm( t$y ~ additive(g),family=‘binomial’) rfit <- glm( t$y ~ recessive(g),family=‘binomial’) dfit <- glm( t$y ~ dominant(g),family=‘binomial’) gfit <- glm( t$y ~ genotype(g),family=‘binomial’) Equivalent models: – genotype = dominant + recessive – genotype = additive + recessive – genotype = additive + dominant – genotype ~ standard chi-squared test of genotype association

Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b 0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b 1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733. b 2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

Analysis of Deviance Chi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76

Model Comparison Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 >

Scanning all Markers > logscan(t,model= ‘ additive ’ ) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 …

Multilocus Models Can test the effects of fitting two or more markers simultaneously Several multilocus models are possible Interaction Model assumes that each combination of genotypes has a different effect eg t$y ~ t$m10 * t$m15

Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26), family='binomial’, test="Chisq") > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597

Adding the effects of Sex and other Covariates Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age Fit models of the form fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

Adding the effects of Sex and other Covariates Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age anova(fit1,fit2)

Generalised Linear Models Applies linear modelling framework to data that are not normally distributed Can be used for probability distributions from the exponential family: – f(y|  ) = h(y) exp (  T(y)  –  Expected value of y is a transformed version of a linear predictor E(y) = g -1 (Xb) g() is called the link function Examples include logistic regression, survival analysis In R GLMs are fitted using glm() and models compared using the anova() function

Random Effects Mixed Models So far our models have had fixed effects – each parameter can take any value independently of the others – each parameter estimate uses up a degree of freedom – models with large numbers of parameters have problems saturation - overfitting poor predictive properties numerically unstable and difficult to fit in some cases we can treat parameters as being sampled from a distribution – random effects only estimate the parameters needed to specify that distribution can result in a more robust model

Example of a Mixed Model Testing genetic association across a large number of of families – y i = trait value of i’th individual –  i = genotype of individual at SNP of interest –  i = family identifier (factor) – e i = error, variance  2 – y =  + e – If we treat family effects  as fixed then we must estimate a large number of parameters – Better to think of these effects as having a distribution N(0,  2 ) for some variance  which must be estimated individuals from the same family have the same f and are correlated – cov =  2 individuals from different families are uncorrelated – genotype parameters  still treated as fixed effects mixed model

Mixed Models y =  + e Fixed effects model – E(y) =  – Var(y) = I  2 I is identity matrix Mixed model – E(y) =  – Var(y) = I  2 + F  2 F is a matrix, F ij = 1 if i, j in same family Need to estimate both  2 and  2

Generalised Least Squares y = Xb + e Var(y) = V (symmetric covariance matrix) V = I  2 (uncorrelated errors) V = I  2 + F  2 (single grouping random effect) V = I  2 + G  2 (G = genotype identity by state) GLS solution, if V is known unbiased, efficient (minimum variance) consistent (tends to true value in large samples) asymptotically normally distributed

Multivariate Normal Distribution Joint distribution of a vector of correlated observations Another way of thinking about the data Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN y ~ MVN( , V) – m is vector of expected values – V is variance-covariance matrix – If V = I  2 then elements of y are uncorrelated and equivalent to n independent Normal variables – probability density/Likelihood is

Henderson’s Mixed Model Equations General linear mixed model.  are fixed effect u are random effects X, Z design matrices

Mixed Models Fixed effects models are special cases of mixed models Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal distribution is reasonable – Differences between estimated parameters and statistical significance in fixed vs mixed models Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.

Mixed Models in R Several R packages available, e.g. – lme4general purpose mixed models package – emma for genetic association in structured populations Formulas in lme4, eg test for association between genotype and trait H 0 : E(y) =  vs H 1 : E(y) =  Var(y) = I  2 + F  2 – h0 = lmer(y ~ 1 +(1|family), data=data) – h1 = lmer(y ~ genotype +(1|family), data=data) – anova(h0,h1)

Formulas in lmer() y ~ fixed.effects + (1|Group1) + (1|Group2) etc – random intercept models y ~ fixed.effects +(1|Group1) + (x|Group1) – random slope model

Example: HDL Data > library(lme4) > b=read.delim("Biochemistry.txt”)  cc=complete.cases(b$Biochem.HDL) > f0=lm(Biochem.HDL ~ Family, data=b,subset=cc) > f1=lmer(Biochem.HDL ~ (1|Family), data=b,subset=cc) > anova(f0) Analysis of Variance Table Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family 175 147.11 0.84064 5.2098 < 2.2e-16 *** Residuals 1681 271.24 0.16136 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > f1 Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev 2161 2178 -1078 2149 2155 Random effects: Groups Name Variance Std.Dev. Family (Intercept) 0.066633 0.25813 Residual 0.161487 0.40185 Number of obs: 1857, groups: Family, 176 Fixed effects: Estimate Std. Error t value (Intercept) 1.60615 0.02263 70.97 > plot(resid(f0),resid(f1))

Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.

Similar presentations

Presentation on theme: "Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.

Similar presentations

Presentation on theme: "Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics."— Presentation transcript:

Similar presentations

About project

Feedback