Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 14, 2011 Introduction to Statistical Measurement and Modeling
Data examples Boxing and neurological injury Scientific question: Does amateur boxing lead to decline in neurological performance? Some related statistical questions: Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure? Is boxing-associated decline independent of initial cognition and age? Is there a threshold of boxing that initiates harm?
Boxing data
Outline Topic #1: Confounding Topic #2: Signal / noise decomposition Handling this is crucial if we are to draw correct conclusions about risk factors Topic #2: Signal / noise decomposition Signal: Regression model predictions Noise: Residual variation Another way of approaching inference, precision of prediction
Topic # 1: Confounding Confound means to “confuse” When the comparison is between groups that are otherwise not similar in ways that affect the outcome Coffee drinking and smoking re CVD Lurking variables,….
Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * * * * * * * * * * * Ice Cream eaten
JHU Intro to Clinical Research Confounding Epidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between Ice Cream Consumption Drowning rate ?? July 2010 JHU Intro to Clinical Research
Confounding Statistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C Ice Cream Eaten Drowning rate Remind what “adjustment” means: direct vs. indirect effect Outdoor Temperature
Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten
JHU Intro to Clinical Research Effect modification A characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E Ice Cream Consumption Drowning rate Birth control pills and smoking re CVD Outdoor temperature July 2010 JHU Intro to Clinical Research
Effect Modification: Drowning and Eating Ice Cream * * * * * * * * * * Drowning rate * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten
Topic #2: Signal/Noise Decomposition Lovely due to geometry of least squares Facilitates testing involving multiple parameters at once Provides insight into R-squared
Signal/Noise Decomposition First step: decomposition of variance “Regression” part: Variance of s “Error” or “Residual” part: Variance of e Together: These determine “total” variance of Ys “Sums of Squares” (SS) rather than variance per se Regression SS (SSR): Error SS (SSE): Total SS (SST):
Signal/Noise Decomposition Properties SST = SSR + SSE SSR/SST = “proportion of variance explained” by regression = R-squared Follows from geometry SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions Provides convenient testing methods Follows from geometry plus assumptions Do the first one from the geometry
Signal/Noise Decomposition SSR and SSE are independent Define M = span(X) and take “Y” as centered at It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥ Gram-Schmidt orthogonalization Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1} Distribution of Z = N(TE[Y|X],σ2I) Distribution of Z = N(TE[Y|X],TVar(Y)Tʹ) = N(TE[Y|X],Tσ2ITʹ) = N(TE[Y|X],σ2I) (TTʹ=I)
Signal/Noise Decomposition SSR and SSE are independent - continued TY=Z Y = T’Z SSE = squared length of = SSR = squared length of = Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent SSE expression due to (because ejs are orthogonal, length=1)
Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2 Thus SSE = = ~ σ2 χ2n-p-1 under A1-A5 (a sum of k independent squared N(0,1) is ) SSE expression due to (because ejs are orthogonal, length=1)
Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions For j ≤ p E[Zj|X] ≠ 0 in general Exception: H0: β1=…=βp = 0 Then SSR = ~ σ2 χ2p under A1-A5 and ~ Fp,n-p-1 ~ with numerator and denominator independent. Here: pause to remark re the t distribution
Signal/Noise Decomposition An organizational tool: The analysis of variance (ANOVA) table SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error SSE n-p-1 SSE/(n-p-1) = Total SST = SSR + SSE n-1 F = MSR/MSE
“Global” hypothesis tests These involve sets of parameters Hypotheses of the form H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. Example 3: H0: all coefficients involving a variable = 0. [Note wording of the hypothesis: all=0 vs any not eq.0] a. Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 [NO ASSOCIATION BETWEEN GEOGRAPHICAL LOCATION & TEMP] b. Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. [LONGITUDE ASSOCIATION IS LINEAR] c. Example 3: H0: all coefficients involving a variable = 0.
“Global” hypothesis tests Testing method: Sequential decomposition of sums of squares Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model Fit model excluding xj1,...,xjpj: Save SSE = SSEs Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)] Distribution under null: F(pj,n-p-1) Define rejection region based on this distribution Compute S Reject or not as S is in rejection region or not Draw out the model on the board – box in part of it Draw the F
Signal/Noise Decomposition An augmented version for global testing SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p X1 SST-SSEs p1 X2|X1 SSES-SSEL p2 (SSES-SSEL )/p2 Error SSEL n-p-1 SSEL/(n-p-1) Total SST = SSR + SSE n-1 Go back and forth to geometry slide (ok to Boxing) F = MSR(2|1)/MSE
R-squared – Another view From last lecture: ECDF Corr(Y, ) squared More conventional: R2 = SSR/SST Geometry justifies why they are the same Cov(Y, ) = Cov(Y- + , ) = Cov(e, ) + Var( ) Covariance = inner product first term = 0 A measure of precision with which regression model describes individual responses Illustrate on plot; complete the argument
Outline: A few more topics Colinearity Overfitting Influence Mediation Multiple comparisons
Main points Confounding occurs when an apparent association between a predictor and outcome reflects the association of each with a third variable A primary goal of regression is to “adjust” for confounding Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework An association of an outcome with predictors is evidenced if SS due to regression is large relative to SSE Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2 ANOVA