Chapter 12 Simple Linear Regression and Correlation

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

BA 275 Quantitative Business Methods
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Simple Linear Regression. G. Baker, Department of Statistics University of South Carolina; Slide 2 Relationship Between Two Quantitative Variables If.
Linear regression models
Multiple Regression Predicting a response with multiple explanatory variables.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Chapter Topics Types of Regression Models
Simple Linear Regression Analysis
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Chapter 13: Inference in Regression
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Environmental Modeling Basic Testing Methods - Statistics III.
Linear Models Alan Lee Sample presentation for STATS 760.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Chapter 13 Simple Linear Regression
Warm-Up The least squares slope b1 is an estimate of the true slope of the line that relates global average temperature to CO2. Since b1 = is very.
Chapter 5 Joint Probability Distributions and Random Samples
Lecture 11: Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Inference for Least Squares Lines
REGRESSION G&W p
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 13 Nonlinear and Multiple Regression
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Correlation and regression
Simple Linear Regression
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
Checking Regression Model Assumptions
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
Prepared by Lee Revere and John Large
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Chapter 12 Simple Linear Regression and Correlation
Simple Linear Regression
Random Variable X, with pmf p(x) or pdf f(x)
Correlation and Simple Linear Regression
Review of Chapter 2 Some Basic Concepts: Sample center
Simple Linear Regression and Correlation
Simple Linear Regression
Linear Regression and Correlation
Simple Linear Regression
Linear Regression and Correlation
Statistics for Business and Economics
Estimating the Variance of the Error Terms
Chapter Thirteen McGraw-Hill/Irwin
Introduction to Regression
St. Edward’s University
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Chapter 12 Simple Linear Regression and Correlation 12.1 - The Simple Linear Regression Model 12.2 - Estimating Model Parameters 12.3 - Inferences About the Slope Parameter 1 12.4 - Inferences Concerning Y|X* and the Prediction of Future Y Values 12.5 - Correlation

Testing for association between two POPULATION variables X and Y… Parameter Estimation via SAMPLE DATA … Categorical variables Numerical variables  Chi-squared Test  ??????? Categories of X Categories of Y PARAMETERS Means: Variances: Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)

Parameter Estimation via SAMPLE DATA … Numerical variables  ??????? STATISTICS PARAMETERS PARAMETERS Means: Means: Variances: Variances: Covariance: Covariance: (can be +, –, or 0)

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290:1486-1493 Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) X

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290:1486-1493 Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Does this suggest a linear trend between X and Y? X If so, how do we measure it?

LINEAR Testing for association between two population variables X and Y… ^ Numerical variables  ??????? PARAMETERS Means: Variances: Covariance: Linear Correlation Coefficient: Always between –1 and +1

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn  ??????? STATISTICS PARAMETERS PARAMETERS Y Means: Means: JAMA. 2003;290:1486-1493 Variances: Variances: Scatterplot (n data points) Covariance: Covariance: (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 X

Parameter Estimation via SAMPLE DATA … Example in R (reformatted for brevity): Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn > pop = seq(0, 20, 0.1) > x = sort(sample(pop, 10)) 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 > y = sample(pop, 10) 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0  ??????? STATISTICS PARAMETERS PARAMETERS Y > c(mean(x), mean(y)) 7.05 12.08 > var(x) 29.48944 > var(y) 43.76178 Means: Means: JAMA. 2003;290:1486-1493 Variances: Variances: plot(x, y, pch = 19) Scatterplot n = 10 (n data points) Covariance: Covariance: > cov(x, y) -25.86667 (can be +, –, or 0) Linear Correlation Coefficient: Always between –1 and +1 > cor(x, y) -0.7200451 X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association Scatterplot (n data points) X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association IQ Scatterplot (n data points) Head circum –1 0 +1 r positive linear correlation negative linear correlation X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association Body Temp Scatterplot (n data points) Age –1 0 +1 r positive linear correlation negative linear correlation X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association r measures the strength of linear association Profit Scatterplot (n data points) Price –1 0 +1 r positive linear correlation negative linear correlation X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association r measures the strength of linear association Profit Scatterplot (n data points) Price –1 0 +1 r A strong positive correlation exists between ice cream sales and drowning. Cause & Effect? NOT LIKELY… “Temp (F)” is a confounding variable. A strong positive correlation exists between ice cream sales and drowning. Cause & Effect? positive linear correlation negative linear correlation X

Parameter Estimation via SAMPLE DATA … Numerical variables x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient: Always between –1 and +1 Y JAMA. 2003;290:1486-1493 r measures the strength of linear association > cor(x, y) -0.7200451 Profit Scatterplot (n data points) Price –1 0 +1 r positive linear correlation negative linear correlation X

Test Statistic for p-value Testing for linear association between two numerical population variables X and Y… Now that we have r, we can conduct HYPOTHESIS TESTING on  Linear Correlation Coefficient Test Statistic for p-value Linear Correlation Coefficient 2 * pt(-2.935, 8) p-value = .0189 < .05

“Response = Model + Error” Parameter Estimation via SAMPLE DATA … If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 Find estimates and for the “best” line in what sense??? Residuals

“Response = Model + Error” Parameter Estimation via SAMPLE DATA … SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 Find estimates and for the “best” line “Least Squares Regression Line” i.e., that minimizes in what sense??? Residuals

“Response = Model + Error” SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Linear Correlation Coefficient: r measures the strength of linear association “Response = Model + Error” > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals Check 

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1 predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X 1.1 1.8 2.1 predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 observed response fitted response > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

Test Statistic for p-value? Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value? Linear Regression Coefficients

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) -0.7200451 Find estimates and for the “best” line i.e., that minimizes Residuals

Test Statistic for p-value Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? > plot(x, y, pch = 19) > lsreg = lm(y ~ x) # or lsfit(x,y) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM??? Because this second method generalizes…

ANOVA Table Source df SS MS F-ratio p-value Treatment Error Total –

ANOVA Table Source df SS MS F-ratio p-value Regression Error Total – ?

? ANOVA Table 1 Source df SS MS F-ratio p-value Regression Error Total – ?

Test Statistic for p-value Testing for linear association between two numerical population variables X and Y… Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Linear Regression Coefficients “Response = Model + Error” Test Statistic for p-value Linear Regression Coefficients Same t-score as H0:  = 0! p-value = .0189

? ? ? ? ANOVA Table 1 8 Source df SS MS F-ratio p-value Regression Error 8 Total – ? ? ? ?

x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Parameter Estimation via SAMPLE DATA … STATISTICS Means: Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points)

Parameter Estimation via SAMPLE DATA … x1 x2 x3 x4 … xn y1 y2 y3 y4 yn STATISTICS Means: Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSTot is a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

Parameter Estimation via SAMPLE DATA … x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSReg is a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

Parameter Estimation via SAMPLE DATA … x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Means: Variances: STATISTICS JAMA. 2003;290:1486-1493 Scatterplot (n data points) SSErr is a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES X predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 9 (43.76178) Residuals = 393.856

SSTot = SSReg + SSErr ~ E R C I S SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 ~ E R C I S observed response fitted response residuals > cor(x, y) -0.7200451 = 204.2 = 189.656 = 393.856 Residuals minimum SSTot = SSReg + SSErr Tot Err Reg

ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200 MSReg Fk – 1, n – k 0 < p < 1 Error 8 189.656 MSErr Total 9 393.856 –

ANOVA Table Source df SS MS F-ratio p-value Regression 1 204.200 8.61349 0.018857 Error 8 189.656 23.707 Total 9 393.856 – Same as before!

> summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) Source df SS MS F-ratio p-value Regression 1 204.200 8.61349 0.018857 Error 8 189.656 23.707 Total 9 393.856 – > summary(aov(lsreg)) Df Sum Sq Mean Sq F value Pr(>F) x 1 204.20 204.201 8.6135 0.01886 * Residuals 8 189.66 23.707

Source df SS MS F-ratio p-value Regression 1 204.200 8.61349 0.018857 Error 8 189.656 23.707 Total 9 393.856 – Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> cor(x, y) -0.7200451 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. Moreover,

> plot(x, y, pch = 19) > lsreg = lm(y ~ x) > abline(lsreg) > summary(lsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.6607 -3.2154 0.8954 3.4649 5.7742 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.2639 2.6097 6.999 0.000113 *** x -0.8772 0.2989 -2.935 0.018857 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 Coefficient of Determination The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290:1486-1493 X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

Summary of Linear Correlation and Simple Linear Regression 95% Confidence Intervals Means Variances Covariance (see notes for “95% prediction intervals”) Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn upper 95% confidence band Linear Correlation Coefficient JAMA. 2003;290:1486-1493 X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line lower 95% confidence band minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc.

Summary of Linear Correlation and Simple Linear Regression Means Variances Covariance Given: X Y x1 x2 x3 x4 … xn y1 y2 y3 y4 yn Linear Correlation Coefficient JAMA. 2003;290:1486-1493 X Y –1  r  +1 measures the strength of linear association Least Squares Regression Line minimizes SSErr = = SSTot – SSReg (ANOVA) All point estimates can be upgraded to CIs for hypothesis testing, etc. proportion of total variability modeled by the regression line’s variability. Coefficient of Determination

Analysis of Variance (ANOVA) Recall ~ Analysis of Variance (ANOVA) k  2 independent, equivariant, normally-distributed “treatment groups” MODEL ASSUMPTIONS? “Regression Diagnostics” 1 2 k = H0:

rotate line (?) 34 degrees

Each may be viewed either as: a point estimate of a prediction of an individual Yi. can extend to a confidence interval can extend to a prediction interval rotate line (?) 34 degrees

rotate line (?) 34 degrees

and Independent

Residual plot: Want to see a random scatterplot evenly distributed about 0, consistent with bell curves having constant variance.

“Polynomial Regression” Model = Errors are autocorrelated; may need to use specialized “time series” methods. “Variance stabilizing” formulas “Weighted” Least Squares (WLS) Model = “Polynomial Regression” (but still considered to be linear regression in the beta coefficients)