Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Regression Analysis

Similar presentations


Presentation on theme: "Multiple Regression Analysis"— Presentation transcript:

1 Multiple Regression Analysis
Chapter 14 Multiple Regression Analysis

2 What is the general purpose of regression?
To model the relationship between the dependent variable y (response) and one or more independent variables x (predictors or explanatory variables) For example, some variation in the price of a house in a large city can be attributed to the size of a house, but there are other variables that also attribute to the price of a house; such as age, lot size, number of bedrooms and bathrooms, etc. Multiple regression can be used to fit models to data with two or more independent variables.

3 The equation to determine salary is
Consider a school district in which teachers with no prior teaching experience and no college credits beyond a bachelor’s degree start at an annual salary of $38,000. Suppose that for each year of teaching experience up to 20 years, the teacher receives an additional $800 and that each unit of postgraduate credit up to 75 credits results in an additional $60 per year. Let: y = salary of a teacher with at most 20 years experience and at most 75 postgraduate units x1 = number of years of experience x2 = number of postgraduate units The equation to determine salary is In a simple regression model, x1 and x2 represent two observations of a single variable. In multiple regression, x1 and x2 represent two independent variables! What if y is not entirely determined by the two (or more) independent variables? Since y is determined entirely by x1 and x2, this is a deterministic model. How can this scenario be modeled using multiple regression?

4 General Additive Multiple Regression Model
A general additive multiple regression model, which relates a dependent variable y to k predictor variables x1, x2, . . ., xk, is given by the model equation The random deviation e is assumed to be normally distributed with mean value 0 and standard deviation s for any particular values x1, …, xk. This implies that for fixed x1, x2, …, xk values, y has a normal distribution with standard deviation s and b’s are the population regression coefficients. Each bi can be interpreted as the mean change in y when the predictor xi increase 1 unit and the value of all the other predictors remains fixed. This is called the population regression function. Remember, this is the amount that a point randomly deviates from the regression model.

5 y = GPA at end of sophomore year
Data collected in a survey of approximately 1000 second-year college students suggest that GPA at the end of the second year is related to the student's level of interaction with the faculty and staff and to the student’s commitment to his or her major. Let: y = GPA at end of sophomore year x1 = level of faculty and staff interaction (measured on a scale of 1 to 5) x2 = level of commitment to major (measured on a scale of 1 to 5) One possible population model might be: For sophomores whose level of interaction with the faculty and staff is rated at 4.2 and whose level of commitment to major is rated at 2.1, (mean value of GPA) = (4.2) + .16(2.1) = 3.12 (mean value of GPA) = (4.2) + .16(2.1) = 3.12 It is likely that a y value will be within 2s (.30) of this mean value (3.12 ± .30). This interval is from 2.82 to 3.42.

6 Polynomial Regression
Suppose a scatterplot has the following appearance: Would a line be a good fit for these data? Explain. It looks like a parabola (quadratic function) would provide a good fit for the data.

7 Polynomial Regression
The kth degree polynomial regression model is is a special case of the general multiple regression model with x1 = x, x2 = x2, x3 = x3, , xk = xk Note that we include the random deviation e since this is a probabilistic model. Note also that bi cannot be interpreted since all the x values are functions of a single variable. The most important special case (other than the simple regression model when k = 1) is the quadratic regression model y = a + b1x + b2x2 + e This is the population regression function (mean y value for fixed values of the predictors).

8 Many researchers have examined factors that are believed to contribute to the risk of heart attacks. One study found that hip-to-waist ratio was a better predictor of heart attacks than body-mass index. A plot of data from this study of a measure heart-attack risk (y) versus hip-to-waist ratio (x) had a exhibited a curved relationship. A model consistent with summary values given in the paper is Suppose the hip-to-waist ratio is 1.3, what are the possible values of the heart-attack risk measure (if s = 0.25)? y = (1.3) (1.3)2 = 1.16 It is likely that the heart-attack risk measure for a person with a hip-to-waist ratio of 1.3 is between and 1.66.

9 Suppose that an industrial chemist is interested in
Suppose that an industrial chemist is interested in the relationship between product yield (y) from a certain chemical reaction and two independent variables, x1 = reaction temperature and x2 = pressure at which the reaction is carried out. The chemist initially suggest that for temperatures between 80 and 110 in combination with pressure values ranging from 50 to 70, the relationship can be modeled by Consider the mean y value for three different particular temperature values: x1 = 90: mean y value = (90) – 35x2 = 2550 – 35x2 x1 = 95: mean y value = (95) – 35x2 = 2625 – 35x2 x1 = 100: mean y value = (100) – 35x2 = 2700 – 35x2 Because chemical theory suggest that the decline in average yield when pressure x2 increases should be more rapid for a high temperature than for a low temperature, the chemist now has reason to doubt the appropriateness of the proposed model. Notice each is a straight line with a slope of Let’s look at plot of these three lines. x2 Mean y value 2700 – 35x2 (x1 = 100) 2625 – 35x2 (x1 = 95) 2550 – 35x2 (x1 = 90)

10 This third predictor variable is an interaction term.
Chemical Reaction Continued . . . A better model would include a third predictor variable x1x2. One such model is Consider the mean y value for three different particular temperature values: x1 = 90: mean y value = (90) + 60x2 - 0x2 = 2250 – 30x2 x1 = 95: mean y value = (95) + 60x2 - 95x2 = 2625 – 35x2 x1 = 100: mean y value = (100) + 60x x2 = 3000 – 40x2 Notice that these all have different slopes as seen in the plot of these lines. This third predictor variable is an interaction term. x2 Mean y value 2700 – 35x2 (x1 = 100) 2625 – 35x2 (x1 = 95) 2550 – 35x2 (x1 = 90)

11 Interaction Between Variables
More than one interaction predictor can be included in the model when more than two independent variables are available. If the change in the mean y associated with a 1-unit increase in one independent variable (slope) depends on the value of a second independent variable, there is interaction between these two variable. When the variables are denoted by x1 and x2, such interaction can be modeled by including x1x2, the product of the variables that interact, as a predictor variable. The general equation for a multiple regression model based on two independent variables x1 and x2 that also includes an interaction predictor is In quadratic regression, the full quadratic, or complete second-order model is:

12 Qualitative Predictor Variables
Qualitative or categorical variables can also be incorporated into a multiple regression model through the use of an indicator variable or dummy variable. An indicator variable will use the values of 0 and 1 to indicate the different categories. Example: gender of students If a qualitative variable had three or more categories, then multiple indicator variables are needed. In general, incorporating a categorical variable with c possible categories into a regression model requires the use of c – 1 indicator variables. Location of houses in Californian beach resort

13 What category for location would be represented by x1 = 0 and x2 = 0?
One of the factors that has an effect on the price of a house is location. We might want to incorporate location, as well as numerical predictors, such as size and age, into a multiple regression model for predicting house price. California beach community houses can be classified by location into three categories – ocean view and beachfront, ocean view but not beachfront, and no ocean view. Let: What category for location would be represented by x1 = 0 and x2 = 0? We could then consider a multiple regression model of the form

14 In this example, an observation would be (x1, x2, x3, y).
One way colleges measure success is by graduation rates. The Education Trust publishes graduation rates along with other college characteristics. Let’s consider the following variables: y = 6-year graduation rate x1 = median SAT score of students accepted to the college x2 = student-related expense per full time student (in dollars) x3 = One possible model that would be considered to describe the relationship between y and these three predictors is Note that two of these predictors are numerical variables and one is categorical. As in simple regression, we will need to estimate the regression coefficients of a, b1, b2, and b3 by calculating a, b1, b2, and b3. 1 if college has only female students or only male students 0 if college has both male and female students In simple regression, an observation is an (x,y) pair. In multiple regression, an observation would consist of the k independent variables and the dependent variable – so it would have k + 1 terms. In this example, an observation would be (x1, x2, x3, y).

15 Least-Squares Estimates
According to the principles of least-squares, the fit of a particular estimated regression function a + b1x bkxk to the observed data is measured by the sum of the squared deviations between the observed y values and the y values predicted by the estimated regression function: The least-squares estimates of a, b1, . . ., bk are those values of a, b1, , bk that make this sum of squared deviations as small as possible. The least squares estimates for a given data set are obtained by solving a system of k + 1 equations in the k + 1 unknowns a, b1, . . ., bk (called the normal equations). This is difficult to do by hand, but all the commonly used statistical software packages have been programmed to solve for these. y

16 These are the estimates for the regression coefficients.
Graduation Rates Continued . . . Minitab output from a regression command requesting that the model y = a + b1x1 + b2x2 + b3x3 + e be fit to the small college data (found on pages of the textbook) is given below: The regression equation is y = x x x3 Predictor Coef SE Coef T P Constant 0.1976 -1.98 0.064 x1 3.30 0.004 x2 1.55 0.139 x3 2.10 0.050 S = R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F Regression 3 37.16 0.000 Residual Error 18 Total 21 What are the interpretations of the coefficients of the predictor variables x2 and x3? These are the estimates for the regression coefficients. This value is interpreted as the average change in 6-year graduation rate for a 1 unit increase in median SAT score for enrolling students while the type of institution and the expenditures remain fixed.

17 This value is the adjusted R2.
Graduation Rates Continued . . . Minitab output from a regression command requesting that the model y = a + b1x1 + b2x2 + b3x3 + e be fit to the small college data (found on pages of the textbook) is given below: This is the coefficient of multiple determination. It is the proportion of the variation in 6 year graduation rates that can be explained by the multiple regression model. This is se, the estimated standard deviation of the random deviation e. This value is the adjusted R2. The regression equation is y = x x x3 Predictor Coef SE Coef T P Constant 0.1976 -1.98 0.064 x1 3.30 0.004 x2 1.55 0.139 x3 2.10 0.050 S = R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F Regression 3 37.16 0.000 Residual Error 18 Total 21

18 Recall that SSResid is the sum of the squared residuals.
Is the model useful? We use se, R2, and the adjusted R2 to determine how useful the multiple regression model is. Recall that SSTo is the sum of the squared deviations of the observed y values from the mean of y – it is a measure of the total variability in the y values. The estimate for the random deviation variance s2 is given by The coefficient of multiple determination is Recall that SSResid is the sum of the squared residuals. Residuals are the differences between the observed y values and the predicted y values. The df = n - (k + 1) because (k + 1) df are lost in estimating the k + 1 coefficients a, b1, . . ., bk.

19 Is the model useful? Continued …
The adjusted R2 is computed using The adjusted R2 takes into account the number of predictor variables. This is important because, given that you use a large number of predictors, you can account for most of the variability in y, even if no real relationship exist. Because the value in the square brackets exceeds 1, the value of r2 adjusted is always smaller than r2. On rare occasions, the adjusted R2 may be negative.

20 Is this model useful? Let’s look at these three values again.
Graduation Rates Continued . . . Minitab output from a regression command requesting that the model y = a + b1x1 + b2x2 + b3x3 + e be fit to the small college data (found on pages of the textbook) is given below: The value of se is small and the value of R2 is large. This means that most of the variation is accounted for by the model and the observations have little deviation from the predicted y values. Also, the values of R2 and the adjusted R2 are close, which suggests that we haven’t used too many predictors in our model. Is this model useful? Let’s look at these three values again. The regression equation is y = x x x3 Predictor Coef SE Coef T P Constant 0.1976 -1.98 0.064 x1 3.30 0.004 x2 1.55 0.139 x3 2.10 0.050 S = R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F Regression 3 37.16 0.000 Residual Error 18 Total 21

21 F Distributions The model utility test for multiple regression is based on a probability distribution called the F distribution. Like the t and c2 distributions, the F distributions are based on df. However, it is based upon the df1 for the numerator of the test statistic and on the df2 for the denominator of the test statistic. Each different combination of df1 and df2 produces a different F distribution.

22 F Distributions Continued . . .
Here are some graphs of different F curves The P-value is the area under the associated F curve to the right of the calculated F value. Most statistical software packages and graphing calculators will compute this P-value. All F tests in this textbook are upper-tailed. F curve for df1 = 3 and df2 = 18 F curve for df1 = 18 and df2 = 3

23 F Test for Modal Utility
Null Hypothesis: H0: b1 = b2 = … = bk = 0 Alternative Hypothesis: At least one of b1, …, bk are not 0 Test Statistic: Assumptions: For any combination of predictor variables values, the distribution of e is normal with mean 0 and constant variance s2. SSRegr = SSTo - SSResid There is no useful linear relationship between y and ANY of the predictors. There is a useful linear relationship between y and at least one of the predictors.

24 Graduation Rates Continued . . .
The model y = a + b1x1 + b2x2 + b3x3 + e was fitted to the small college data (found on pages of the textbook). H0: b1 = b2 = b3 = 0 Ha: at least one of the three b’s is not 0 Assumptions: A normal probability plot of the standardized residuals is quite straight, indicating that the assumption of normality of the random deviation distribution is reasonable.

25 Graduation Rates Continued . . .
H0: b1 = b2 = b3 = 0 Ha: at least one of the three b’s is not 0 Test Statistic: df1 = 3, df2 = 18, a = .05, P-value ≈ 0 Since P-value < a, we reject H0. There is evidence to confirm the usefulness of the multiple regression model.

26 Graduation Rates Continued . . .
Minitab output from a regression command requesting that the model y = a + b1x1 + b2x2 + b3x3 + e be fit to the small college data (found on pages of the textbook) is given below: Notice the sum of squares are given in the Analysis of Variance Table. Dividing these two MS terms produces the F test statistic. Dividing the SSRegr by its df produces the numerator of the F test statistic. Similarly, dividing the SSResid by its df produces the denominator of the F test statistic. The regression equation is y = x x x3 Predictor Coef SE Coef T P Constant 0.1976 -1.98 0.064 x1 3.30 0.004 x2 1.55 0.139 x3 2.10 0.050 S = R-Sq = 86.1% R-Sq(adj) = 83.8% Analysis of Variance Source DF SS MS F Regression 3 37.16 0.000 Residual Error 18 Total 21

27 What factors contribute to the price of energy bars
What factors contribute to the price of energy bars? Minitab output for data (found on page 825 of the textbook) based on the following variables is shown below. y = price x1 = calorie content x2 = protein content x3 = fat content The regression equation is Price = Calories Protein Fat Predictor 3Coef SE Coef T P Constant 0.2511 0.3524 0.71 0.487 Calories 0.73 0.478 Protein 3.58 0.003 Fat 1.22 0.242 S = R-Sq = 74.7% R-Sq(adj) = 69.6% Analysis of Variance Source DF SS MS F Regression 3 3.4453 1.1484 14.76 0.000 Residual Error 15 1.1670 0.0778 Total 18 4.6122 According to the F test for model utility, the fitted multiple regression model is useful in predicting the price of the energy bars. However, looking at the t tests for each predictor, it appears that only the variable on protein content is useful. Let’s redo our model to include only the protein predictor variable.

28 What factors contribute to the price of energy bars
What factors contribute to the price of energy bars? Minitab output for data (found on page 825 of the textbook) based on the following variable is shown below. y = price x2 = protein content Since the model with just one predictor accounts for almost as much of the variation in y values (69.4%) as the multiple regression model (69.6%) - it is preferable to use the more simple model. The regression equation is Price = Protein Predictor 3Coef SE Coef T P Constant 0.6072 0.1419 4.28 0.001 Protein 6.47 0.000 S = R-Sq = 71.7% R-Sq(adj) = 69.4% Analysis of Variance Source DF SS MS F Regression 1 3.2809 41.90 Residual Error 17 1.3313 0.0763 Total 18 4.6122 According to the F test for model utility, the fitted simple regression model is also useful in predicting the price of the energy bars.


Download ppt "Multiple Regression Analysis"

Similar presentations


Ads by Google