Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11: Simple Linear Regression

Similar presentations


Presentation on theme: "Chapter 11: Simple Linear Regression"— Presentation transcript:

1 Chapter 11: Simple Linear Regression

2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
Where We’ve Been Presented methods for estimating and testing population parameters for a single sample Extended those methods to allow for a comparison of population parameters for multiple samples McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

3 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
Where We’re Going Introduce the straight-line linear regression model as a means of relating one quantitative variable to another quantitative variable Introduce the correlation coefficient as a means of relating one quantitative variable to another quantitative variable Assess how well the simple linear regression model fits the sample data Use the simple linear regression model to predict the value of one variable given the value of another variable McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

4 11.1: Probabilistic Models
There may be a deterministic reality connecting two variables, y and x But we may not know exactly what that reality is, or there may be an imprecise, or random, connection between the variables. The unknown/unknowable influence is referred to as the random error So our probabilistic models refer to a specific connection between variables, as well as influences we can’t specify exactly in each case: y = f(x) + random error McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

5 11.1: Probabilistic Models
The relationship between home runs and runs in baseball seems at first glance to be deterministic … McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

6 11.1: Probabilistic Models
But if you consider how many runners are on base when the home run is hit, or even how often the batter misses a base and is called out, the rigid model becomes more variable. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

7 11.1: Probabilistic Models
General Form of Probabilistic Models y = Deterministic component + Random error where y is the variable of interest, and the mean value of the random error is assumed to be 0: E(y) = Deterministic component. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

8 11.1: Probabilistic Models
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

9 11.1: Probabilistic Models
The goal of regression analysis is to find the straight line that comes closest to all of the points in the scatter plot simultaneously. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

10 11.1: Probabilistic Models
A First-Order Probabilistic Model y = 0 + 1x +  where y = dependent variable x = independent variable 0 + 1x = E(y) = deterministic component  = random error component 0 = y – intercept 1 = slope of the line McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11 11.1: Probabilistic Models
0, the y – intercept, and 1, the slope of the line, are population parameters, and invariably unknown. Regression analysis is designed to estimate these parameters. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

12 11.2: Fitting the Model: The Least Squares Approach
Step 1 Hypothesize the deterministic component of the probabilistic model E(y) = 0 + 1x Step 2 Use sample data to estimate the unknown parameters in the model McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

13 11.2: Fitting the Model: The Least Squares Approach
Values on the line are the predicted values of total offerings given the average offering. The distances between the scattered dots and the line are the errors of prediction. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

14 11.2: Fitting the Model: The Least Squares Approach
Values on the line are the predicted values of total offerings given the average offering. The line’s estimated parameters are the values that minimize the sum of the squared errors of prediction, and the method of finding those values is called the method of least squares. The distances between the scattered dots and the line are the errors of prediction. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

15 11.2: Fitting the Model: The Least Squares Approach
Estimates: Deviation: SSE: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

16 11.2: Fitting the Model: The Least Squares Approach
The least squares line is the line that has the following two properties: The sum of the errors (SE) equals 0. The sum of squared errors (SSE) is smaller than that for any other straight-line model. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

17 11.2: Fitting the Model: The Least Squares Approach
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

18 11.2: Fitting the Model: The Least Squares Approach
Can home runs be used to predict errors? Is there a relationship between the number of home runs a team hits and the quality of its fielding? McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

19 11.2: Fitting the Model: The Least Squares Approach
Home Runs (x) Errors (y) xi2 xiyi 158 126 24964 19908 155 87 24025 13485 139 65 19321 9035 191 95 36481 18145 124 119 15625 14756 xi = 767 yi = 492 xi2 = xiyi = 75329 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

20 11.2: Fitting the Model: The Least Squares Approach
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

21 11.2: Fitting the Model: The Least Squares Approach
These results suggest that teams which hit more home runs are (slightly) better fielders (maybe not what we expected). There are, however, only five observations in the sample. It is important to take a closer look at the assumptions we made and the results we got. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

22 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.3: Model Assumptions Assumptions 1. The mean of the probability distribution of  is 0. 2. The variance,  2, of the probability distribution of  is constant. 3. The probability distribution of  is normal. 4. The values of  associated with any two values of y are independent. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

23 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.3: Model Assumptions The variance,  2, is used in every test statistic and confidence interval used to evaluate the model. Invariably,  2 is unknown and must be estimated. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

24 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.3: Model Assumptions McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

25 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Note: There may be many different patterns in the scatter plot when there is no linear relationship. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

26 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A critical step in the evaluation of the model is to test whether 1 = 0 y y y x x x Positive Relationship No Relationship Negative Relationship 1 > 1 = 1 < 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

27 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 H0 : 1 = 0 Ha : 1 ≠ 0 y y y x x x Positive Relationship No Relationship Negative Relationship 1 > 1 = 1 < 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

28 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 The four assumptions described above produce a normal sampling distribution for the slope estimate: called the estimated standard error of the least squares slope estimate. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

29 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

30 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Home Runs (x) Errors (y) xi -  (xi - )2  = E(Errors|HRs) yi--  (yi-- )2 158 126 4.6 21.16 98.14 27.86 776.4 155 87 1.6 2.56 98.31 -11.31 127.9 139 65 -14.4 207.4 99.23 -34.23 1171 191 95 37.6 1414 96.25 -1.245 1.55 124 119 -29.4 864.4 100.1 18.92 357.8 xi = 767 yi = 492 SSxx= 2509 SSE = 2435  = 153.4 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

31 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Since the t-value does not lead to rejection of the null hypothesis, we can conclude that A different set of data may yield different results. There is a more complicated relationship. There is no relationship (non-rejection does not lead to this conclusion automatically). McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

32 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Interpreting p-Values for  Software packages report two-tailed p-values. To conduct one-tailed tests of hypotheses, the reported p-values must be adjusted: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

33 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A Confidence Interval on 1 where the estimated standard error is and t/2 is based on (n – 2) degrees of freedom McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

34 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 In the home runs and errors example, the estimated 1 was , and the estimated standard error was With 3 degrees of freedom, t = The confidence interval is, therefore, which includes 0, so there may be no relationship between the two variables. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

35 11.5: The Coefficients of Correlation and Determination
The coefficient of correlation, r, is a measure of the strength of the linear relationship between two variables. It is computed as follows: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

36 11.5: The Coefficients of Correlation and Determination
Positive linear relationship No linear relationship Negative linear relationship y y y x x x r → r  r → -1 Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

37 11.5: The Coefficients of Correlation and Determination
In the example about homeruns and errors, SSxy= and SSxx= 2509. SSyy is computed as so McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

38 11.5: The Coefficients of Correlation and Determination
An r value that close to zero suggests there may not be a linear relationship between the variables, which is consistent with our earlier look at the null hypothesis and the confidence interval on  1. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

39 11.5: The Coefficients of Correlation and Determination
The coefficient of determination, r2, represents the proportion of the total sample variability around the mean of y that is explained by the linear relationship between x and y. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

40 11.5: The Coefficients of Correlation and Determination
High r 2 x provides important information about y Predictions are more accurate based on the model Low r 2 Knowing values of x does not substantially improve predictions on y There may be no relationship between x and y, or it may be more subtle than a linear relationship Predict values of y with the mean of y if no other information is available Predict values of y|x based on a hypothesized linear relationship Evaluate the power of x to predict values of y with the coefficient of determination McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

41 11.6: Using the Model for Estimation and Prediction
Statistical Inference based on the linear regression model Estimate the mean of y for a specific value of x: E(y)|x (over many experiments with this x-value) Estimate an individual value of y for a given x value (for a single experiment with this value of x) McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

42 11.6: Using the Model for Estimation and Prediction
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

43 11.6: Using the Model for Estimation and Prediction
Based on our model results, a team that hits 140 home runs is expected to make 99.9 errors: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

44 11.6: Using the Model for Estimation and Prediction
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

45 11.6: Using the Model for Estimation and Prediction
A 95% Prediction Interval for an Individual Team’s Errors McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

46 11.6: Using the Model for Estimation and Prediction
Prediction intervals for individual new values of y are wider than confidence intervals on the mean of y because of the extra source of error. Error in E(y|xp) Error in predicting a mean value of y|xp Error in E(y|xp) Sampling error from the y population Error in predicting a specific value of y|xp McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

47 11.6: Using the Model for Estimation and Prediction
McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

48 11.6: Using the Model for Estimation and Prediction
Estimating y beyond the range of values associated with the observed values of x can lead to large prediction errors. Beyond the range of observed x values, the relationship may look very different. Estimated relationship True relationship Xi Xj Range of observed values of x McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

49 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Step 1 How does the proximity of a fire house (x) affect the damages (y) from a fire? y = f(x) y = 0 +1x +  McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

50 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

51 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

52 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Step 2 The data (found in Table 11.7) produce the following estimates (in thousands of dollars): The estimated damages equal $10,280 + $4910 for each mile from the fire station, or McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

53 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Step 3 The estimate of the standard deviation, , of  is s = Most of the observed fire damages will be within 2s  4.64 thousand dollars of the predicted value McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

54 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Step 4 Test that the true slope is 0 SAS automatically performs a two-tailed test, with a reported p-value < The one-tailed p-value is < , which provides strong evidence to reject the null. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

55 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Step 4 A 95% confidence interval on 1 from the SAS output is ≤ 1 ≤ The coefficient of determination, r 2, is The coefficient of correlation, r, is McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

56 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Suppose the distance from the nearest station is 3.5 miles. We can estimate the damage with the model estimates. We’re 95% sure the damage for a fire 3.5 miles from the nearest station will be between $22,324 and $32,667. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

57 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression
11.7: A Complete Example Suppose the distance from the nearest station is 3.5 miles. We can estimate the damage with the model estimates. We’re 95% sure the damage for a fire 3.5 miles from the nearest station will be between $22,324 and $32,667. Since the x-values in our sample range from .7 to 6.1, predictions about y for x-values beyond this range will be unreliable. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression


Download ppt "Chapter 11: Simple Linear Regression"

Similar presentations


Ads by Google