Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables.

Similar presentations


Presentation on theme: "Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables."— Presentation transcript:

1 Chapter 14 Multiple Regression Models

2 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables x 1, x 2,…, x k is given by the model equation y =  +  1 x 1 +  2 x 2 + … +  k x k + e The random deviation e is assumed to be normally distributed with mean value 0 and variance  2 for any particular values of x 1, x 2,…, x k. This implies that for fixed x 1, x 2,…, x k values, y has a normal distribution with variance  2 and (mean y value for fixed x 1, x 2,…, x k values) =  +  1 x 1 +  2 x 2 + … +  k x k Multiple Regression Models

3 3 The  i ’s are called population regression coefficients; each  i can be interpreted as the true average change in y when the predictor x i increases by 1 unit and the values of all the other predictors remain fixed. The deterministic portion  +  1 x 1 +  2 x 2 + … +  k x k is called the population regression function. Multiple Regression Models

4 4 The k th degree polynomial regression model y =  +  1 x +  2 x 2 + … +  k x k + e Is a special case of the general multiple regression model with x 1 = x, x 2 = x 2, …, x k = x k. The population regression function (mean value of y for fixed values of the predictors) is  +  1 x +  2 x 2 + … +  k x k. The most important special case other than simple linear regression (k = 1) is the quadratic regression model y =  +  1 x +  2 x 2. This model replaces the line y =  +  x with a parabolic cure of mean values  +  1 x +  2 x 2. If  2 > 0, the curve opens upward, whereas if  2 < 0, the curve opens downward. Polynomial Regression Models

5 5 If the change in the mean y value associated with a 1-unit increase in one independent variable depends on the value of a second independent variable, there is interaction between these two variables. When the variables are denoted by x 1 and x 2, such interaction can be modeled by including x 1 x 2, the product of the variables that interact, as a predictor variable. Interaction

6 6 Up to now, we have only considered the inclusion of quantitative (numerical) predictor variables in a multiple regression model. Two types are very common:  Dichotomous variable: One with just two possible categories coded 0 and 1 Example Gender {male, female} Marriage status {married, not-married}  Ordinal variables: Categorical variables that have a natural ordering Activity level {light, moderate, heavy} coded respectively as 1, 2 and 3 Education level {none, elementary, secondary, college, graduate} coded respectively 1, 2, 3, 4, 5 (or for that matter any 5 consecutive integers} Qualitative Predictor Variables.

7 7 According to the principle of least squares, the fit of a particular estimated regression function a + b 1 x 1 + b 2 x 2 + … + b k x k to the observed data is measured by the sum of squared deviations between the observed y values and the y values predicted by the estimated function:  [y –(a + b 1 x 1 + b 2 x 2 + … + b k x k )] 2 The least squares estimates of ,  1,  2,…,  k are those values of a, b 1, b 2, …, b k that make this sum of squared deviations as small as possible. Least Square Estimates

8 8 Predicted Values & Residuals

9 9 Sums of Squares

10 10 Estimate for  2

11 11 Coefficient of Multiple Determination, R 2

12 12 Adjusted R 2 Generally, a model with large R 2 and small s e are desirable. If a large number of variables (relative to the number of data points) is used those conditions may be satisfied but the model will be unrealistic and difficult to interpret.

13 13 F Distributions F distributions are similar to a Chi-Square Distributions, but have two parameters, df den and df num.

14 14 The F Test for Model Utility The regression sum of squares denoted by SSReg is defined by SSREG = SSTo - SSresid

15 15 The F Test for Model Utility

16 16 The F Test Utility of the Model y =  +  1 x 1 +  2 x 2 + … +  k x k + e Null hypothesis: H 0 :  1 =  2 = … =  k =0 (There is no useful linear relationship between y and any of the predictors.) Alternate hypothesis: H a : At least one among  1,  2, …,  k is not zero (There is a useful linear relationship between y and at least one of the predictors.)

17 17 The F Test Utility of the Model y =  +  1 x 1 +  2 x 2 + … +  k x k + e

18 18 The F Test Utility of the Model y =  +  1 x 1 +  2 x 2 + … +  k x k + e The test is upper-tailed, and the information in the Table of Values that capture specified upper-tail F curve areas is used to obtain a bound or bounds on the P-value using numerator df = k and denominator df = n - (k + 1). Assumptions: For any particular combination of predictor variable values, the distribution of e, the random deviation, is normal with mean 0 and constant variance.

19 19 An Example During a summer NSF program for teachers of statistics, the participants were asked to break into groups and develop a project similar in scope to what we would like to have our students develop. One of these groups decided that it would study lung capacity of adult humans measured in liters. To measure the capacities of a sample of adults (the sample was not particularly easy to obtain on the campus during the summer so we “shanghaied” everyone that was willing to stand still, be measured and interviewed. We used borrowed (antique liquid displacement apparatus) equipment and collected data.

20 20 An Example This group recorded a number of variables including gender (m or f), age (yrs), height (in), weight (lbs), waist (in), chest girth (in), smoking (Y or N), activity level (1 - light, 1 - medium, 3 - heavy) along with the lung capacity (liters). The code for the gender is 0 = Female 1 = Male The code for smoking is0 = No 1 = Yes The data follows on the next slides

21 21 An Example - The Data

22 22 An Example - The Data

23 23 An Example - The Data

24 24 Analysis - 1 st with Minitab Regression Analysis: Capacity versus Age, Height,... The regression equation is Capacity = - 6.17 - 0.0140 Age + 0.149 Height + 0.00636 Weight - 0.0087 Chest - 0.0220 Waist + 0.343 Activity - 0.109 Smoke - 0.409 Gender 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.172 2.653 -2.33 0.027 Age -0.014032 0.007000 -2.00 0.054 Height 0.14856 0.03503 4.24 0.000 Weight 0.006359 0.006094 1.04 0.305 Chest -0.00867 0.05791 -0.15 0.882 Waist -0.02197 0.04557 -0.48 0.633 Activity 0.3427 0.1282 2.67 0.012 Smoke -0.1092 0.1491 -0.73 0.469 Gender -0.4086 0.2757 -1.48 0.148 S = 0.4607 R-Sq = 84.3% R-Sq(adj) = 80.2%

25 25 Analysis - 2 nd with Minitab Notice that the P-values on the right suggest that only the predictors height (P-value = 0.000) and activity level (P-value = 0.012) are significant at the 0.05 level of significance. The only other variable that seem possibly significant are age (P- value = 0.054 and gender (P-value =0.148). When stepwise regression techniques are applied using Minitab, the variables that remain significant are height, activity level, age and gender. The output is on the next two slides.

26 26 Analysis - 2 nd with Minitab Stepwise Regression: Capacity versus Age, Height,... Alpha-to-Enter: 0.1 Alpha-to-Remove: 0.1 Response is Capacity on 8 predictors, with N = 40 N(cases with missing observations) = 1 N(all cases) = 41 Step 1 2 3 4 Constant -10.251 -9.759 -9.787 -6.929 Height 0.209 0.191 0.198 0.161 T-Value 10.42 9.87 10.43 6.55 P-Value 0.000 0.000 0.000 0.000 Activity 0.35 0.31 0.30 T-Value 2.87 2.60 2.67 P-Value 0.007 0.013 0.011

27 27 Analysis - 2 nd with Minitab Activity 0.35 0.31 0.30 T-Value 2.87 2.60 2.67 P-Value 0.007 0.013 0.011 Age -0.0109 -0.0137 T-Value -1.96 -2.54 P-Value 0.057 0.016 Gender -0.47 T-Value -2.24 P-Value 0.032 S 0.534 0.490 0.472 0.448 R-Sq 74.06 78.78 80.84 83.23 R-Sq(adj) 73.38 77.63 79.24 81.32 C-p 15.1 7.8 5.8 3.0

28 28 Analysis - 2 nd with Minitab The resulting Minitab output from the regression analysis using those 4 predictors follows. Regression Analysis: Capacity versus Height, Activity, Gender, Age The regression equation is Capacity = - 6.93 + 0.161 Height + 0.302 Activity - 0.466 Gender - 0.0137 Age 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.929 1.708 -4.06 0.000 Height 0.16079 0.02454 6.55 0.000 Activity 0.3025 0.1133 2.67 0.011 Gender -0.4658 0.2082 -2.24 0.032 Age -0.013744 0.005404 -2.54 0.016 S = 0.4477 R-Sq = 83.2% R-Sq(adj) = 81.3%

29 29 Analysis - 2 nd with Minitab Consider the following graphs: residuals vs fits and the normal plot of the residual.

30 30 Analysis - 2 nd with Minitab

31 31 Analysis - 2 nd with Minitab Notice that both of these graphs appear to indicate that the assumptions made were justifiable. This multilinear model appears to provide a reasonably acceptable model for estimating lung capacity.

32 32 Analysis - 3 rd with Minitab An number of the members on the project team felt that other variables, specifically height/weight and chest/waist rations as well as the square of the chest girth multiplied by the height might be better predictor variables. When these three combination variables were calculated and added to the height, activity level, age and gender the following Minitab output was obtained.

33 33 Analysis - 3 rd with Minitab Regression Analysis: Capacity versus Height, Activity,... The regression equation is Capacity = - 6.22 + 0.160 Height + 0.307 Activity - 0.469 Gender - 0.0150 Age - 1.04 HT/WT + 0.01 CH/Waist -0.000002 c2h 40 cases used 1 cases contain missing values Predictor Coef SE Coef T P Constant -6.220 2.111 -2.95 0.006 Height 0.16012 0.02915 5.49 0.000 Activity 0.3072 0.1211 2.54 0.016 Gender -0.4686 0.2245 -2.09 0.045 Age -0.015039 0.006613 -2.27 0.030 HT/WT -1.042 1.574 -0.66 0.512 CH/Waist 0.011 1.305 0.01 0.993 c2h -0.00000221 0.00000737 -0.30 0.766 S = 0.4635 R-Sq = 83.6% R-Sq(adj) = 80.0%

34 34 Analysis - 1 st with Minitab None of these three variables appeared to be significant. The fact that the girth 2 height which would be proportional (approximately) to the volume of the body came as a surprise to the members of the team. As a side note, the literature on spirography suggests that height is the most significant factor in lung capacity and this was what this particular study indicated after it was completely analyzed.


Download ppt "Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables."

Similar presentations


Ads by Google