Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

Topic 12: Multiple Linear Regression
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
The Multiple Regression Model.
Hypothesis Testing Steps in Hypothesis Testing:
Simple Linear Regression
Chapter 12 Simple Linear Regression
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 13 Multiple Regression
Korelasi Ganda Dan Penambahan Peubah Pertemuan 13 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
1 Pertemuan 13 Uji Koefisien Korelasi dan Regresi Matakuliah: A0392 – Statistik Ekonomi Tahun: 2006.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Chapter 11 Multiple Regression.
Ch. 14: The Multiple Regression Model building
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Simple Linear Regression and Correlation
Chapter 7 Forecasting with Simple Regression
Introduction to Regression Analysis, Chapter 13,
Hypothesis tests for slopes in multiple linear regression model Using the general linear test and sequential sums of squares.
Review Guess the correlation. A.-2.0 B.-0.9 C.-0.1 D.0.1 E.0.9.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Chapter 13: Inference in Regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Econ 3790: Business and Economics Statistics
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
CHAPTER 14 MULTIPLE REGRESSION
1 1 Slide Simple Linear Regression Coefficient of Determination Chapter 14 BA 303 – Spring 2011.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
An alternative approach to testing for a linear association The Analysis of Variance (ANOVA) Table.
Chapter 13 Multiple Regression
Regression Analysis Relationship with one independent variable.
Simple Linear Regression (SLR)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Introduction to regression July 13, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University Introduction to Statistical Measurement.
Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Environmental Modeling Basic Testing Methods - Statistics III.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Chapter 12 Simple Linear Regression n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n Testing.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Summary of the Statistics used in Multiple Regression.
Analysis of variance approach to regression analysis … an (alternative) approach to testing for a linear association.
INTRODUCTION TO MULTIPLE REGRESSION MULTIPLE REGRESSION MODEL 11.2 MULTIPLE COEFFICIENT OF DETERMINATION 11.3 MODEL ASSUMPTIONS 11.4 TEST OF SIGNIFICANCE.
Statistics for Managers using Microsoft Excel 3rd Edition
Essentials of Modern Business Statistics (7e)
John Loucks St. Edward’s University . SLIDES . BY.
Regression model with multiple predictors
Confounding and Effect Modification
Relationship with one independent variable
Quantitative Methods Simple Regression.
Business Statistics Multiple Regression This lecture flows well with
Review of Chapter 3 where Multiple Linear Regression Model:
Prepared by Lee Revere and John Large
Review of Chapter 2 Some Basic Concepts: Sample center
Relationship with one independent variable
Presentation transcript:

Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 14, 2011 Introduction to Statistical Measurement and Modeling

Data examples Boxing and neurological injury Scientific question: Does amateur boxing lead to decline in neurological performance? Some related statistical questions: Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure? Is boxing-associated decline independent of initial cognition and age? Is there a threshold of boxing that initiates harm?

Boxing data

Outline Topic #1: Confounding Topic #2: Signal / noise decomposition Handling this is crucial if we are to draw correct conclusions about risk factors Topic #2: Signal / noise decomposition Signal: Regression model predictions Noise: Residual variation Another way of approaching inference, precision of prediction

Topic # 1: Confounding Confound means to “confuse” When the comparison is between groups that are otherwise not similar in ways that affect the outcome Coffee drinking and smoking re CVD Lurking variables,….

Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * * * * * * * * * * * Ice Cream eaten

JHU Intro to Clinical Research Confounding Epidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between Ice Cream Consumption Drowning rate ?? July 2010 JHU Intro to Clinical Research

Confounding Statistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C Ice Cream Eaten Drowning rate Remind what “adjustment” means: direct vs. indirect effect Outdoor Temperature

Confounding Example: Drowning and Eating Ice Cream * * * * * * * Drowning rate * * * * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten

JHU Intro to Clinical Research Effect modification A characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E Ice Cream Consumption Drowning rate Birth control pills and smoking re CVD Outdoor temperature July 2010 JHU Intro to Clinical Research

Effect Modification: Drowning and Eating Ice Cream * * * * * * * * * * Drowning rate * * * * * * Warm temperature * * * * * * * * * * Cool temperature Ice Cream eaten

Topic #2: Signal/Noise Decomposition Lovely due to geometry of least squares Facilitates testing involving multiple parameters at once Provides insight into R-squared

Signal/Noise Decomposition First step: decomposition of variance “Regression” part: Variance of s “Error” or “Residual” part: Variance of e Together: These determine “total” variance of Ys “Sums of Squares” (SS) rather than variance per se Regression SS (SSR): Error SS (SSE): Total SS (SST):

Signal/Noise Decomposition Properties SST = SSR + SSE SSR/SST = “proportion of variance explained” by regression = R-squared Follows from geometry SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions Provides convenient testing methods Follows from geometry plus assumptions Do the first one from the geometry

Signal/Noise Decomposition SSR and SSE are independent Define M = span(X) and take “Y” as centered at It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥ Gram-Schmidt orthogonalization Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1} Distribution of Z = N(TE[Y|X],σ2I) Distribution of Z = N(TE[Y|X],TVar(Y)Tʹ) = N(TE[Y|X],Tσ2ITʹ) = N(TE[Y|X],σ2I) (TTʹ=I)

Signal/Noise Decomposition SSR and SSE are independent - continued TY=Z Y = T’Z SSE = squared length of = SSR = squared length of = Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent SSE expression due to (because ejs are orthogonal, length=1)

Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2 Thus SSE = = ~ σ2 χ2n-p-1 under A1-A5 (a sum of k independent squared N(0,1) is ) SSE expression due to (because ejs are orthogonal, length=1)

Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions For j ≤ p E[Zj|X] ≠ 0 in general Exception: H0: β1=…=βp = 0 Then SSR = ~ σ2 χ2p under A1-A5 and ~ Fp,n-p-1 ~ with numerator and denominator independent. Here: pause to remark re the t distribution

Signal/Noise Decomposition An organizational tool: The analysis of variance (ANOVA) table SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error SSE n-p-1 SSE/(n-p-1) = Total SST = SSR + SSE n-1 F = MSR/MSE

“Global” hypothesis tests These involve sets of parameters Hypotheses of the form H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. Example 3: H0: all coefficients involving a variable = 0. [Note wording of the hypothesis: all=0 vs any not eq.0] a. Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 [NO ASSOCIATION BETWEEN GEOGRAPHICAL LOCATION & TEMP] b. Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. [LONGITUDE ASSOCIATION IS LINEAR] c. Example 3: H0: all coefficients involving a variable = 0.

“Global” hypothesis tests Testing method: Sequential decomposition of sums of squares Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model Fit model excluding xj1,...,xjpj: Save SSE = SSEs Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)] Distribution under null: F(pj,n-p-1) Define rejection region based on this distribution Compute S Reject or not as S is in rejection region or not Draw out the model on the board – box in part of it Draw the F

Signal/Noise Decomposition An augmented version for global testing SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p X1 SST-SSEs p1 X2|X1 SSES-SSEL p2 (SSES-SSEL )/p2 Error SSEL n-p-1 SSEL/(n-p-1) Total SST = SSR + SSE n-1 Go back and forth to geometry slide (ok to Boxing) F = MSR(2|1)/MSE

R-squared – Another view From last lecture: ECDF Corr(Y, ) squared More conventional: R2 = SSR/SST Geometry justifies why they are the same Cov(Y, ) = Cov(Y- + , ) = Cov(e, ) + Var( ) Covariance = inner product first term = 0 A measure of precision with which regression model describes individual responses Illustrate on plot; complete the argument

Outline: A few more topics Colinearity Overfitting Influence Mediation Multiple comparisons

Main points Confounding occurs when an apparent association between a predictor and outcome reflects the association of each with a third variable A primary goal of regression is to “adjust” for confounding Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework An association of an outcome with predictors is evidenced if SS due to regression is large relative to SSE Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2 ANOVA