Download presentation
Presentation is loading. Please wait.
1
CHAPTER 29: Multiple Regression*
Basic Practice of Statistics - 3rd Edition CHAPTER 29: Multiple Regression* Basic Practice of Statistics 7th Edition Lecture PowerPoint Slides Chapter 5
2
In Chapter 29, We Cover … Parallel regression lines
Estimating parameters Using technology Inference for multiple regression Interaction The general multiple linear regression model The woes of regression coefficients Inference for regression parameters Checking the conditions for inference
3
Introduction When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we fit a regression line to the data to describe the relationship. Previously, we did regression with just one explanatory variable—we will now call this simple linear regression to remind us that this is a special case. In some cases, other explanatory variables might improve our understanding of the response y and help us to better predict y. We now explore the more general case of multiple regression, which allows for several explanatory variables to combine in explaining a response variable.
4
Parallel Regression Lines
Consider a scatterplot that shows two parallel straight lines linking y to x1 (one line, my = β0 + β1 x1, for each of two groups). An indicator variable (x2) can be added to the regression equation to denote the two categories: my = β0 + β1 x1 + β2 x2 Indicator Variable An indicator variable places individuals into one of two categories, usually coded by the values 0 and 1.
5
Parallel Regression Line
Model with indicator variable (x2): my = β0 + β1 x1 + β2 x2 when x2 = 0, my = β0 + β1 x1 when x2 = 1, my = β0 + β1 x1 + β2 = (β0 + β2) + β1 x1 Note that the slopes (β1) are the same, but the intercepts may be different (the difference is determined by β2).
6
Example Percent of possible jurors reporting for jury duty based on the reporting date (reporting date is coded as “1” for the first two weeks of the year up to “26” for the last two weeks of the year), for 1998 and 2000:
7
Example Lines without using indicator variable: 𝑦 =95.571−0.765 𝑥 1
𝑦 =76.426−0.668 𝑥 1
8
Example Lines using indicator variable to force equal slopes: overall model: 𝑦 =77.082−0.717 𝑥 𝑥 2 𝑦 =94.915−0.717 𝑥 1 𝑥 2 =1 𝑦 =77.082−0.717 𝑥 1 𝑥 2 =0
9
Estimating Parameters
How shall we estimate the β’s in the model my = β0 + β1 x1 + β2 x2? The method of least squares obtains estimates of the βi’s (denoted bi’s) by choosing the values that minimize the sum of squared deviations in the y-direction: observed 𝑦−predicted 𝑦 2 = 𝑦− 𝑦 2
10
Estimating Parameters
These differences between the actual y-values and the predicted y-values are called residuals. Estimate the “left-over variation” about the regression model. The remaining parameter to estimate is s, the standard deviation of the response variable y about the mean (assumed the same for all combinations of the x’s). The standard deviation s of the residuals is used to estimate s (also called the regression standard error).
11
Estimating Parameters
regression standard error The regression standard error for the multiple regression model 𝑦 = b0 + b1 x1 + b2 x2 is : 𝑠 = 1 𝑛−3 residual 2 𝑠 = 1 𝑛− 𝑦− 𝑦 2 Use s to estimate the standard deviation s of the responses about the mean given by the population regression model. Here, note that we are estimating b0, b1, and b2; this makes our denominator (and the degrees of freedom for the regression standard error) n – 3. In general, this will be n – (the number of b parameters).
12
Using Technology Example of output from using technology:
Potential Jurors Example of output from using technology: Parameter estimates Standard error ANOVA table Sum of squares (SS) due to the model Sum of squares due to error Total SS = model SS + error SS Squared multiple correlation coefficient (R2)
13
R2 R2 tells us what proportion of the variation in the response variable y is explained by using the set of explanatory variables in the multiple regression model. squared multiple correlation coefficient The squared multiple correlation coefficient (R2) is the square of the correlation coefficient between the observed responses y and the predicted responses 𝑦 ; it is also equal to: 𝑅 2 = variability explained by model total variability in y = model sum of squares total sum of squares R2 is almost always given with a regression model to describe the fit of the model to the data.
14
Inference for Multiple Regression
Conditions: Linear trend (model is correct) Scatterplots of y vs. xi show linear patterns. Normality Residuals are symmetric about 0 and approximately Normal. Constant variance (s the same for all values of x’s) Plot of residuals vs. 𝑦 shows unstructured pattern with approximately equal spread in the y-direction. Independence Observations are not dependent on previous observations; residual plot shows no pattern based on the order of the observations.
15
Potential Jurors – Checking Conditions
Example (a) Linear trend Potential Jurors – Checking Conditions (c) Equal variance (b) Normality (d) Independence
16
Inference for Multiple Regression
For testing the null hypothesis that all of the regression coefficients (b ’s), except b 0, are equal to zero F statistic for regression model The analysis of variance F statistic for testing the null hypotheses that all of the regression coefficients (b ’s), except b 0, are equal to zero has the form 𝐹= variation due to model variation due to error = Model mean square Error mean square
17
Inference for Multiple Regression
If the overall F test is significant, then we may want to know which individual parameters are different from zero. Individual t tests for coefficients To test the null hypothesis that one of the b ’s in a specific regression model is zero, compute the t statistic: 𝑡= parameter estimate standard error of estimate = 𝑏 SE 𝑏 If the conditions for inference are met, then the t distribution with (n – 3) degrees of freedom can be used to compute confidence intervals and conduct hypothesis tests for b0, b1 and b2.
18
Interaction Parallel linear patterns for two categories are somewhat rare; it is more common to see two linear patterns that are not parallel. An interaction term (x1x2) can be added to the regression equation to allow for unequal slopes: my = β0 + β1 x1 + β2 x2 + β3 x1x2 Interaction means that the relationship between the mean response and one explanatory variable x1 changes when we change the value of the other explanatory variable x2.
19
Interaction Model with interaction term (x1x2):
my = β0 + β1 x1 + β2 x2 + β3 x1x2 when x2 = 0, my = β0 + β1 x1 when x2 = 1, my = β0 + β1 x1 + β2 + β3 x1 W hen x2=1, my = (β0 + β2) + (β1+ β3)x1 Note that, in addition to the intercepts being different, the slopes are now different as well (the difference is determined by β3).
20
The General Multiple Linear Regression Model
THE MULTIPLE LINEAR REGRESSION MODEL We have observations on 𝑛 individuals. Each observation consists of values of 𝑝 explanatory variables 𝑥 1 , 𝑥 2 , , 𝑥 𝑝 and a response variable 𝑦. Our goal is to study or predict the behavior of y given the values of the explanatory variables. For any set of fixed values of the explanatory variables, the response 𝑦 varies according to a Normal distribution. Repeated responses 𝑦 are independent of each other. The mean response 𝜇 𝑦 has a linear relationship given by the population regression model 𝜇 𝑦 = 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 +⋯+ 𝛽 𝑝 𝑥 𝑝 The 𝛽 𝑖 ’s are unknown parameters. The standard deviation of 𝑦 (call it 𝜎) is the same for all values of the explanatory variables. The value of 𝝈 is unknown. This model has 𝑝+2 parameters that we must estimate from data: the 𝑝+1 coefficients, 𝛽 0 , 𝛽 1 , ⋯ 𝛽 𝑝 and the standard deviation 𝜎.
21
The Woes of Regression Coefficients
When we start to explore models with several explanatory variables, we quickly meet the big new idea of multiple regression in practice: The relationship between the response y and any one explanatory variable can change greatly depending on what other explanatory variables are present in the model.
22
Inference for Regression Parameters
ANOVA Table:
23
Inference for Regression Parameters
The first formal test in most multiple regression studies is the ANOVA F test. This test is used to check if the complete set of explanatory variables is helpful in predicting the response variable. analysis of variance F test The analysis of variance F statistic for testing the null hypotheses that all of the regression coefficients (b ’s), except b 0, are equal to zero has the form 𝐹= variation due to model variation due to error P-values come from the F distribution with p and n – p – 1 degrees of freedom.
24
Inference for Regression Parameters
Remember that an individual t assesses the contribution of its variable in the presence of the other variables in this specific model. Confidence intervals and individual t tests for coefficients A level C confidence interval for the regression coefficient b is 𝑏± 𝑡 ∗ SE 𝑏 . The critical value 𝑡 ∗ is obtained from the 𝑡 𝑛−𝑝−1 distribution.
25
Inference for Regression Parameters
Confidence intervals and individual t tests for coefficients The 𝑡 statistic for testing the null hypothesis that a regression coefficient 𝛽 is equal to zero has the form 𝑡= parameter estimate standard error of estimate = 𝑏 SE 𝑏 In terms of a random variable T having the 𝑡 𝑛−𝑝−1 distribution, the P-value for a test of H0 against 𝐻 𝑎 :𝛽>0 is 𝑃 𝑇≥𝑡 𝐻 𝑎 :𝛽<0 is 𝑃 𝑇≤𝑡 𝐻 𝑎 :𝛽≠0 is 2𝑃 𝑇≥ 𝑡
26
Inference for Regression Parameters
CONFIDENCE AND PREDICTION INTERVALS FOR MULTIPLE REGRESSION RESPONSE A level C confidence interval for the mean response, 𝝁 𝒚 , is 𝑦 ± 𝑡 ∗ 𝑆𝐸 𝜇 . A level C prediction interval for a single response, 𝒚, is 𝑦 ± 𝑡 ∗ 𝑆𝐸 𝑦 . In both intervals, 𝑡 ∗ is the critical value for the 𝑡 𝑛−𝑝−1 density curve with area C between − 𝑡 ∗ and 𝑡 ∗ .
27
Checking the Conditions for Inference
Plot the response variable against each of the explanatory variables. Plot the residuals against the predicted values and against all of the explanatory variables in the model. Look for outliers and influential observations in all residual plots. Ideally, we would like all of the explanatory variables to be independent and the observations on the response variable to be independent. To check the condition that the response should vary Normally about the multiple regression model, make a histogram or stemplot of the residuals.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.