Chapter 8: Simple Linear Regression Yang Zhenlin.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Lesson 10: Linear Regression and Correlation
Chapter 12 Simple Linear Regression
Inference for Regression
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Correlation and Regression
1 Simple Linear Regression and Correlation The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES Assessing the model –T-tests –R-square.
Simple Linear Regression and Correlation
Objectives (BPS chapter 24)
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 17 Simple Linear Regression and Correlation.
Chapter 12 Simple Linear Regression
Simple Linear Regression
Statistics for Business and Economics
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Chapter 13 Introduction to Linear Regression and Correlation Analysis
1 Simple Linear Regression and Correlation Chapter 17.
REGRESSION AND CORRELATION
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
Lecture 19 Simple linear regression (Review, 18.5, 18.8)
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Correlation & Regression
Correlation and Linear Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Linear Regression and Correlation
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Chapter 11 Correlation and Simple Linear Regression Statistics for Business (Econ) 1.
Economics 173 Business Statistics Lectures Summer, 2001 Professor J. Petry.
Lecture 10: Correlation and Regression Model.
Economics 173 Business Statistics Lecture 10 Fall, 2001 Professor J. Petry
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.
1 Simple Linear Regression Review 1. review of scatterplots and correlation 2. review of least squares procedure 3. inference for least squares lines.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
Chapter 13 Simple Linear Regression
Warm-Up The least squares slope b1 is an estimate of the true slope of the line that relates global average temperature to CO2. Since b1 = is very.
The simple linear regression model and parameter estimation
Inference for Least Squares Lines
Linear Regression.
Linear Regression and Correlation Analysis
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
SIMPLE LINEAR REGRESSION
Simple Linear Regression and Correlation
SIMPLE LINEAR REGRESSION
Presentation transcript:

Chapter 8: Simple Linear Regression Yang Zhenlin

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Learning Objectives Describing the Relationship between Two Variables -- Scatter plot -- Numerical measures Simple Linear Regression Model Least Squares Method for Model Estimation A Measure of Goodness of Fit: R-Square Inference about the Regression Coefficients Predictions -- Predicting the value of a future observation -- Predicting the mean of future observations 2

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Introduction We are interested in the relationship between two numerical variables X and Y. One of these variables, say X, is known in advance, called the explanatory variable, or independent variable. The other variable, Y, is a random variable and its values or its general random behavior is of interest. For this, Y is called the response variable, or dependent variable. If there is a strong relationship between X and Y, one can predict a future random variable Y, based on the known future value of X, through such a “relationship”. To study the relation, n pairs of observations on (X, Y) are collected, denoted as (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ). The Least Squares Method helps finding such a relation. 3

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Describing the Relationship Example 8.1. Prices of used cars and the odometer readings.  A car dealer wants to find the relationship between the odometer reading and the selling price of used cars.  A random sample of 100 cars is selected, and the data recorded.  Construct a scatter plot of the data. The full data Scatter diagram: plot of the pairs of observed values (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables. 4

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Describing the Relationship The plot indeed shows a negative linear relation between the price and the odometer reading. 5

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Besides the graphical display of the data, some numerical measures, such as the sample covariance and the sample coefficient of correlation can be used to measure the direction and strength of the linear relationship between two variables Describing the Relationship Sample Means: Sample Variances: Sample covariance: Sample correlation coefficient: This is called the ‘five statistics summary’ of the data 6

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Example 8.2. Continuing on the Example 8.1, find the five statistics summary and comment on the linear relationship between price and odometer reading. Solution: Describing the Relationship As r =  , there exists a strong negative linear relation … 7

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Describing the Relationship Cov(X, Y) = 0 Strong positive linear relationship. The scatter diagram shows a clear upward trend. No linear relationship. Scatter diagram shows either no pattern, or a non-linear pattern. Strong negative linear relationship. The scatter diagram shows a clear downward trend. or Cov(X, Y) > 0 Cov(X, Y) < 0 Sample Coefficient of Correlation r = +1 0  1 8

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Simple Linear Regression Model The simple linear regression model takes the form: Y = dependent variable X = independent variable  0 = y-intercept  1 = slope of the line  = error variable x y 00 Run Rise   = Rise/Run  0 and  1 are unknown population parameters, therefore need to be estimated from the data. As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors! 9

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Simple Linear Regression Model These n pairs of observations satisfy: As Y is a random variable, so must be . Due to the random sampling mechanism, {Y i } must be independent, and so are the {  i }. Further, it is reasonable to assume that To learn this theoretical relationship, in particular, to estimate the parameters  0 and  1, a random sample of n experimental units are selected, and the values of (Y, X) for each unit are to be observed to give (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ). E(  i ) = 0, i = 1, 2,..., n. For if they are not zero, the non zero constant can be absorbed into  0. Thus, 10

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Least Squares Estimation Based on the observed data, we are seeking a line that best fits the data when two variables are related to one another. We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. Errors Different lines generate different errors, thus different sum of squares of errors. X Y Errors There is a line that minimizes the sum of squared errors, and in this sense it is the best line. 11

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Let be a fitted line. To find the best line that minimizes the sum of squared errors, it is equivalent to find the intercept b 0 and the slope b 1 that The actual Y value of point i The value of point i calculated from the equation The value of point i calculated from the equation That is, to minimize Least Squares Estimation 12

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Taking partial derivatives and set to zero: Leads to Substituting Least Squares Estimation 13

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU And the solutions: Least Squares Estimation gives the least squares equation: 14

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Example 8.3. Continuing on the Example 8.2, find the least squares line relating odometer reading to the price of the used car. Solution: The estimated coefficients are The least squares equation is Interpretation of =  : for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $ Least Squares Estimation 15

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Least Squares Estimation This is the estimated slope of the line. For each additional mile on the odometer, the price decreases by an average of $ Interpreting the Linear Regression Equation The intercept is estimated as $ No data Do not interpret the intercept as the “Price of cars that have not been driven”

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Least Squares Estimation Properties of the Least Squares Estimators. For the simple linear regression model: Where {  i } are independent with E(  i ) = 0, the least squares estimators and are unbiased estimators of  0 and  1, To see this, note that E(Y i ) =  0 +  1 X i, we have More on white board in class. 17

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Measure of Goodness of Fit Sum of Squares due to Errors (SSE)  This is the sum of differences between the points and the regression line.  It can serve as a measure of how well the line fits the data. SSE is defined by –A shortcut formula 18

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Measure of Goodness of Fit Coefficient of Determination R 2 it is a measure of the strength of the linear relationship between the response Y and the explanatory variable(s) X, and is defined as The first definition is a general one and applies to linear regression models with multi predictors. It simplifies to the second definition when there is only one predictor X. In the case of simple linear regression, R 2 is also the square of the sample correlation coefficient r. 19

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU To understand the significance of coefficient of determination, note: SST: total variations (sum of squares) in Y, SSR: sum of squares due to regression, SSE: sum of squares due to error. It follows that R 2 = 1  SSE/SST = SSR/SST R 2 measures the proportion of the variation in Y that is explained by the variation in X, or by the model. R 2 takes on any value between zero and one. R 2 = 1: Perfect match between the line and the data points. R 2 = 0: There are no linear relationship between X and Y. Measure of Goodness of Fit 20

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model Error Variable: Required Conditions The error  is a critical part of the regression model. For formal statistical inferences for the model, four requirements involving the distribution of  must be satisfied.  The probability distribution of  is normal.  The mean of  is zero: E(  ) = 0.  The standard deviation of  is   for all values of X.  The set of errors associated with different observations on Y are all independent. It follows that the response Y is normally distributed with mean E(Y) =  0 +  1 X, and standard deviation  , and that the random sample of n observations {Y 1, Y 2,..., Y n } made on Y are independent. 21

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model   0 +  1 x 1  0 +  1 x 2  0 +  1 x 3 E(y|x 2 ) E(y|x 3 ) x1x1 x2x2 x3x3  E(y|x 1 )  The standard deviation remains constant, but the mean value changes with x Normality of  Changing the X value increases (or decreases if  1 < 0) the mean of Y, but does not change the distributional shape of it. 22

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model Estimate of Error Standard Deviation  The mean error is equal to zero.  If   is small the errors tend to be close to zero (close to the mean error). Then, the model fits the data well.  Therefore, we can also use   as a measure of the suitability of using a linear model.  However,   is unknown and has to be estimated. As SSE is the sum of squared errors, it leads naturally to an It can be shown that 23

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model Example 8.4. Calculate the estimated of error standard deviation and the coefficient of determination for Example 8.1, and describe what does it tell you about the model fit? Solution It is hard to assess the model based on s  even when compared with the mean value of Y, Calculated earlier 24

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model 65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model. Some Theoretical Results. If the errors {  1,  2, …,  n } are independent and identically distributed as N(0, ), then we have (a) (b) (c) Some Theoretical Results. If the errors {  1,  2, …,  n } are independent and identically distributed as N(0, ), then we have (a) (b) (c) 25

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model We can draw inference about  1 from by testing H 0 :  1 = 0 versus H 1 :  1  0 (or 0) Testing the Slope The implication of this test is clear: if H 0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for  1. From the theoretical result given earlier and the results presented in Chapter 5b regarding the t-distribution, it is immediate to see that A statistic for testing the slope parameter or constructing a confidence interval for it. 26

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU A 100(1  )% confidence interval for  1 is given as A 100(1  )% confidence interval for  1 is given as Apparently, the quantity is an estimate of the standard deviation of, and thus referred to as the estimated standard error of. Apparently, the quantity is an estimate of the standard deviation of, and thus referred to as the estimated standard error of. Inferences for the Model Inference concerning the intercept parameter  0 can be carried out in a similar manner, but it is not as interesting and important as for the slope parameter  1. 27

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Inferences for the Model Example 8.5. Test to determine whether there is enough evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 8.4. Use a = 5%. Solution: H 0 :  1 = 0 vs H 1 :  1  0 With n = n  2 = 98, the rejection region is t > t 98 (.025) or t <  t 98 (.025), where t.025  As t =  <  1.984, reject H 0 at 5% level of significance. Yes, there is enough evidence to … A 95% CI for  1 : 28

STAT306, Term II, 09/10 Chapter 8 STAT151, Term I © Zhenlin Yang, SMU Predictions Before using the regression model, we need to assess how well it fits the data. If we are satisfied with how well the model fits the data, we can use it to predict the a future value of Y 0 or the mean of Y 0 based on the future value of X 0. This is in fact an important application of a regression model. The simple linear regression model can be easily extended to include more predictor variables, e.g., in the examples presented, the price of a used car is not only affected by its odometer reading, but also affected by its ‘age’, color, etc. Those constitute important topics in an advanced course: Applied Regression Methods (STAT312) The end. Thank you. 29