Chapter 12 Simple Linear Regression and Correlation

Chapter 12 Simple Linear Regression and Correlation
The Simple Linear Regression Model Estimating Model Parameters Inferences About the Slope Parameter 1 Inferences Concerning Y|X* and the Prediction of Future Y Values Correlation

Testing for association between two POPULATION variables X and Y…
Parameter Estimation via SAMPLE DATA … Categorical variables Numerical variables  Chi-squared Test  ??????? Categories of X Categories of Y PARAMETERS Means: Variances: Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)

Parameter Estimation via SAMPLE DATA …
Numerical variables  ??????? STATISTICS PARAMETERS PARAMETERS Means: Means: Variances: Variances: Covariance: Covariance: (can be +, –, or 0)

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA …
Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Does this suggest a linear trend between X and Y? X If so, how do we measure it?

LINEAR Testing for association between two population variables X and Y… ^ Numerical variables  ??????? PARAMETERS Means: Variances: Covariance: Linear Correlation Coefficient: Always between –1 and +1

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290: Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 X

Example in R (reformatted for brevity): Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn > pop = seq(0, 20, 0.1) > x = sort(sample(pop, 10)) > y = sample(pop, 10)  ??????? STATISTICS PARAMETERS PARAMETERS Y > c(mean(x), mean(y)) > var(x) > var(y) Means: Means: JAMA. 2003;290: Variances: Variances: plot(x, y, pch = 19) Scatterplot n = 10 (n data points) Covariance: Covariance: > cov(x, y) (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 > cor(x, y) X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association Scatterplot (n data points) X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association IQ Scatterplot (n data points) Head circum – r positive linear correlation negative linear correlation X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association Body Temp Scatterplot (n data points) Age – r positive linear correlation negative linear correlation X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association r measures the strength of linear association Profit Scatterplot (n data points) Price – r positive linear correlation negative linear correlation X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association r measures the strength of linear association Profit Scatterplot (n data points) Price – r A strong positive correlation exists between ice cream sales and drowning. Cause & Effect? NOT LIKELY… “Temp (F)” is a confounding variable. A strong positive correlation exists between ice cream sales and drowning. Cause & Effect? positive linear correlation negative linear correlation X

Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290: r measures the strength of linear association > cor(x, y) Profit Scatterplot (n data points) Price – r positive linear correlation negative linear correlation X

Test Statistic for p-value
Testing for linear association between two numerical population variables X and Y… Now that we have r, we can conduct HYPOTHESIS TESTING on  Linear Correlation Coefficient Test Statistic for p-value Linear Correlation Coefficient 2 * pt(-2.935, 8) p-value = < .05

“Response = Model + Error”
Parameter Estimation via SAMPLE DATA … If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line in what sense??? Residuals

Parameter Estimation via SAMPLE DATA … SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line “Least Squares Regression Line” i.e., that minimizes in what sense??? Residuals

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals Check 

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response fitted response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X
predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

Test Statistic for p-value?
Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value? Linear Regression Coefficients

predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) Find estimates and for the “best” line i.e., that minimizes Residuals

Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???
> plot(x, y, pch = 19) > lsreg = lm(y ~ x) # or lsfit(x,y) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** x * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 8 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? Because this second method generalizes…

ANOVA Table Source df SS MS F-ratio p-value Treatment Error Total –

ANOVA Table Source df SS MS F-ratio p-value Regression Error Total – ?

? ANOVA Table 1 Source df SS MS F-ratio p-value Regression Error Total
– ?

Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

? ? ? ? ANOVA Table 1 8 Source df SS MS F-ratio p-value Regression
Error 8 Total – ? ? ? ?

STATISTICS Means: Variances: JAMA. 2003;290: Scatterplot (n data points)

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn STATISTICS Means: Variances: JAMA. 2003;290: Scatterplot (n data points) SSTot is a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290: Scatterplot (n data points) SSReg is a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290: Scatterplot (n data points) SSErr is a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) = 204.2 = = 9 ( ) Residuals =

SSTot = SSReg + SSErr ~ E R C I S
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) = 204.2 = = Residuals minimum SSTot = SSReg + SSErr Tot Err Reg

ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200 MSReg
Fk – 1, n – k 0 < p < 1 Error 8 MSErr Total 9 –

ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200
Error 8 23.707 Total 9 – Same as before!

> summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F)
Source df SS MS F-ratio p-value Regression 1 Error 8 23.707 Total 9 – > summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) x * Residuals

Source df SS MS F-ratio p-value Regression 1 204.200 8.61349 0.018857
Error 8 23.707 Total 9 – Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> cor(x, y) Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> plot(x, y, pch = 19) > lsreg = lm(y ~ x) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** x * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 8 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 8 DF, p-value: Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Summary of Linear Correlation and Simple Linear Regression
Means Variances Covariance Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

95% Confidence Intervals Means Variances Covariance Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn upper 95% confidence band Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line lower 95% confidence band minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

Means Variances Covariance Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290: X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc. proportion of total variability modeled by the regression line’s variability. Coefficient of Determination

Analysis of Variance (ANOVA)
Recall ~ Analysis of Variance (ANOVA) k  2 independent, equivariant, normally-distributed “treatment groups” MODEL ASSUMPTIONS? “Regression Diagnostics” 1 2 k = H0:

rotate line (?) 34 degrees

and Independent

Residual plot: Want to see a random scatterplot evenly distributed about 0, consistent with bell curves having constant variance.

“Polynomial Regression”
Model = Errors are autocorrelated; may need to use specialized “time series” methods. “Variance stabilizing” formulas “Weighted” Least Squares (WLS) Model = “Polynomial Regression” (but still considered to be linear regression in the beta coefficients)

rotate line (?) 34 degrees

186.2, 202.4, 209.0, 220.3, 234.3, 234.5, 237.3, 247.6, 256.2, 259.2, 270.0 rotate line (?) 34 degrees

186.2, 202.4, 209.0, 220.3, 234.3, 234.5, 237.3, 247.6, 256.2, 259.2, 270.0

186.2, 202.4, 209.0, 220.3, 234.3, 234.5, 237.3, 247.6, 256.2, 259.2, 270.0 See textbook section 12.4

Chapter 12 Simple Linear Regression and Correlation

Similar presentations

Presentation on theme: "Chapter 12 Simple Linear Regression and Correlation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 12 Simple Linear Regression and Correlation

Similar presentations

Presentation on theme: "Chapter 12 Simple Linear Regression and Correlation"— Presentation transcript:

Similar presentations

About project

Feedback