Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 7 Linear Correlation & Regression Methods

Similar presentations


Presentation on theme: "CHAPTER 7 Linear Correlation & Regression Methods"— Presentation transcript:

1 CHAPTER 7 Linear Correlation & Regression Methods
7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression

2 Testing for association between two POPULATION variables X and Y…
Parameter Estimation via SAMPLE DATA … Categorical variables Numerical variables  Chi-squared Test  ??????? Categories of X Categories of Y PARAMETERS Means: Variances: Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)

3 Parameter Estimation via SAMPLE DATA …
Numerical variables  ??????? STATISTICS PARAMETERS PARAMETERS Means: Means: Variances: Variances: Covariance: Covariance: (can be +, –, or 0)

4 x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) X

5 x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Does this suggest a linear trend between X and Y? X If so, how do we measure it?

6 LINEAR Testing for association between two population variables X and Y… ^ Numerical variables  ??????? PARAMETERS Means: Variances: Covariance: Linear Correlation Coefficient: Always between –1 and +1

7 x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 X

8 Parameter Estimation via SAMPLE DATA …
Example in R (reformatted for brevity): Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn > pop = seq(0, 20, 0.1) > x = sort(sample(pop, 10)) > y = sample(pop, 10)  ??????? STATISTICS PARAMETERS PARAMETERS Y > c(mean(x), mean(y)) > var(x) > var(y) Means: Means: JAMA. 2003;290: Variances: Variances: plot(x, y, pch = 19) Scatterplot n = 10 (n data points) Covariance: Covariance: > cov(x, y) (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 > cor(x, y) X

9 Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association Scatterplot (n data points) X

10 Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association Scatterplot (n data points) r positive linear correlation negative linear correlation X

11 Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association Scatterplot (n data points) r positive linear correlation negative linear correlation X

12 Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association r measures the strength of linear association Scatterplot (n data points) r positive linear correlation negative linear correlation X

13 Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association > cor(x, y) Scatterplot (n data points) r positive linear correlation negative linear correlation X

14 Test Statistic for p-value
Testing for linear association between two numerical population variables X and Y… Now that we have r, we can conduct HYPOTHESIS TESTING on  Linear Correlation Coefficient Test Statistic for p-value Linear Correlation Coefficient 2 * pt(-2.935, 8) p-value = < .05

15 “Response = Model + Error”
Parameter Estimation via SAMPLE DATA … If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line in what sense??? Residuals

16 “Response = Model + Error”
Parameter Estimation via SAMPLE DATA … SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line “Least Squares Regression Line” i.e., that minimizes in what sense??? Residuals

17 “Response = Model + Error”
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals Check 

18 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

19 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response fitted response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

20 ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

21 ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

22 ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

23 Test Statistic for p-value?
Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value? Linear Regression Coefficients

24 ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

25 Test Statistic for p-value
Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

26 BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???
> plot(x, y, pch = 19) > lsreg = lm(y ~ x) # or lsfit(x,y) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** x * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 8 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? Because this second method generalizes…

27 ANOVA Table Source df SS MS F-ratio p-value Treatment Error Total

28 ANOVA Table Source df SS MS F-ratio p-value Regression Error Total ?

29 ? ANOVA Table 1 Source df SS MS F-ratio p-value Regression Error Total
?

30 Test Statistic for p-value
Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

31 ? ? ? ? ANOVA Table 1 8 Source df SS MS F-ratio p-value Regression
Error 8 Total ? ? ? ?

32 x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA …
STATISTICS Means: Variances: JAMA. 2003;290: Scatterplot (n data points)

33 Parameter Estimation via SAMPLE DATA …
x1 x2 x3 x4 xn y1 y2 y3 y4 yn STATISTICS Means: Variances: JAMA. 2003;290: Scatterplot (n data points) SSTot is a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

34 Parameter Estimation via SAMPLE DATA …
x1 x2 x3 x4 xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290: Scatterplot (n data points) SSReg is a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

35 Parameter Estimation via SAMPLE DATA …
x1 x2 x3 x4 xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290: Scatterplot (n data points) SSErr is a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

36 ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) = 204.2 = = 9 ( ) Residuals =

37 SSTot = SSReg + SSErr ~ E R C I S
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) = 204.2 = = Residuals minimum SSTot = SSReg + SSErr Tot Err Reg

38 ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200 MSReg
Fk – 1, n – k 0 < p < 1 Error 8 MSErr Total 9

39 ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200
Error 8 23.707 Total 9 Same as before!

40 > summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F)
Source df SS MS F-ratio p-value Regression 1 Error 8 23.707 Total 9 > summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) x * Residuals

41 Source df SS MS F-ratio p-value Regression 1 204.200 8.61349 0.018857
Error 8 23.707 Total 9 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

42 > cor(x, y) Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

43 > plot(x, y, pch = 19) > lsreg = lm(y ~ x) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** x * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 8 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

44 Summary of Linear Correlation and Simple Linear Regression
Means Variances Covariance Given: X Y x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

45 Summary of Linear Correlation and Simple Linear Regression
95% Confidence Intervals Means Variances Covariance (see notes for “95% prediction intervals”) Given: X Y x1 x2 x3 x4 xn y1 y2 y3 y4 yn upper 95% confidence band Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line lower 95% confidence band minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

46 Summary of Linear Correlation and Simple Linear Regression
Means Variances Covariance Given: X Y x1 x2 x3 x4 xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc. proportion of total variability modeled by the regression line’s variability. Coefficient of Determination

47 Multilinear Regression
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” For now, assume the “additive model,” i.e., main effects only.

48 Multilinear Regression
Fitted response Residual True response yi X1 X2 Y (x1i , x2i) Predictors Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)! Once calculated, how do we then test the null hypothesis? ANOVA

49 Multilinear Regression
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” R code example: lsreg = lm(y ~ x1+x2+x3)

50 Multilinear Regression
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) R code example: lsreg = lm(y ~ x+x^2+x^3) R code example: lsreg = lm(y ~ x1+x2+x3)

51 Multilinear Regression
Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” “main effects” quadratic terms, etc. (“polynomial regression”) “interactions” “interactions” R code example: lsreg = lm(y ~ x1*x2) R code example: lsreg = lm(y ~ x1+x2+x1:x2) R code example: lsreg = lm(y ~ x+x^2+x^3)

52

53

54

55

56 Recall… Multiple Linear Reg with interaction Example in R (reformatted for brevity): with an indicator (“dummy”) variable: > I = c(1,1,1,1,1,0,0,0,0,0) I = 1 > lsreg = lm(y ~ x*I) > summary(lsreg) Coefficients: Estimate (Intercept) x I x:I I = 0 Suppose these are actually two subgroups, requiring two distinct linear regressions!

57 ANOVA Table (revisited)
Note that if true, then it would follow that From sample of n data points…. Note that if true, then it would follow that But how are these regression coefficients calculated in general? “Normal equations” solved via computer (intensive).

58 ANOVA Table (revisited)
(based on n data points). Source df SS MS F p-value Regression Error Total *** How are only the statistically significant variables determined? ***

59 “MODEL SELECTION”(BE)
Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. “MODEL SELECTION”(BE) If significant, then… X1 + …… X2 X3 X4 Step 1. t-tests: …… …… p-values: p1 < p2 < p4 < .05 …… Reject H Reject H Accept H Reject H0 Step 2. Are all coefficients significant at level  ? If not….

60 “MODEL SELECTION”(BE)
Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. “MODEL SELECTION”(BE) If significant, then… X1 + …… X2 X3 X4 Step 1. t-tests: …… …… p-values: p1 < p2 < p4 < .05 …… Reject H Reject H Accept H Reject H0 Step 2. Are all coefficients significant at level  ? If not…. delete that term, X1 X2 X3 X4 + ……

61 “MODEL SELECTION”(BE)
Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. “MODEL SELECTION”(BE) If significant, then… X1 + …… X2 X3 X4 Step 1. t-tests: …… …… p-values: p1 < p2 < p4 < .05 …… Reject H Reject H Accept H Reject H0 Step 2. Are all coefficients significant at level  ? If not…. delete that term, and recompute new coefficients! X1 + …… X2 X4 X1 X2 X4 + …… Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model

62 Analysis of Variance (ANOVA)
Recall ~ Analysis of Variance (ANOVA) k  2 independent, equivariant, normally-distributed “treatment groups” MODEL ASSUMPTIONS? 1 2 k = H0:

63 “Regression Diagnostics”

64

65

66

67

68 “Polynomial Regression”
Model = “Polynomial Regression” (but still considered to be linear regression in the beta coefficients)

69

70

71

72

73 Re-plot data on a “log-log” scale.

74

75

76 Re-plot data on a “log” scale (of Y only)..

77 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

78 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)

79 “MAXIMUM LIKELIHOOD ESTIMATION”
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) = example of a general “link function” “MAXIMUM LIKELIHOOD ESTIMATION” (Note: Not based on LS implies “pseudo-R2,” etc.)

80 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) Suppose one of the predictor variables is binary… SUBTRACT!

81 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) Suppose one of the predictor variables is binary… SUBTRACT!

82 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) Suppose one of the predictor variables is binary…

83 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) Suppose one of the predictor variables is binary…

84 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) Suppose one of the predictor variables is binary… ODDS RATIO ………….. implies …………..

85 in population dynamics
Unrestricted population growth (e.g., bacteria) Restricted population growth (disease, predation, starvation, etc.) Population size y obeys the following law Population size y obeys the following law, constant a > 0, and “carrying capacity” M. with constant a > 0. Let survival probability  = With initial condition Logistic growth Exponential growth


Download ppt "CHAPTER 7 Linear Correlation & Regression Methods"

Similar presentations


Ads by Google