6-4 Other Aspects of Regression

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Techniques I EXST7005 Multiple Regression.
12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
12 Multiple Linear Regression CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Simple Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Multiple Regression
Chapter 11 Multiple Regression.
Multiple Linear Regression
Ch. 14: The Multiple Regression Model building
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Objectives of Multiple Regression
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Chapter 12 Multiple Regression and Model Building.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Chapter 14 Introduction to Multiple Regression
Chapter 15 Multiple Regression and Model Building
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 9 Multiple Linear Regression
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Chapter 11: Simple Linear Regression
Multiple Regression Analysis and Model Building
Analysis of Covariance (ANCOVA)
12 Inferential Analysis.
5-5 Inference on the Ratio of Variances of Two Normal Populations
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
6-4 Other Aspects of Regression
6-1 Introduction To Empirical Models
Prepared by Lee Revere and John Large
Multiple Regression Models
Correlation and Simple Linear Regression
12 Inferential Analysis.
Simple Linear Regression
7-2 Factorial Experiments
Product moment correlation
3.2. SIMPLE LINEAR REGRESSION
Linear Regression and Correlation
St. Edward’s University
Presentation transcript:

6-4 Other Aspects of Regression 6-4.1 Polynomial Models

6-4 Other Aspects of Regression 6-4.1 Polynomial Models

6-4 Other Aspects of Regression 6-4.1 Polynomial Models Suppose that we wanted to test the contribution of the second-order terms to this model. In other words, what is the value of expanding the model to include the additional terms? 𝑌= 𝛽 0 + 𝛽 1 𝑇−1212.5 + 𝛽 2 𝑅−12.444 +𝜖

6-4 Other Aspects of Regression 6-4.1 Polynomial Models Full Model: Reduced Model: 𝑌= 𝛽 0 + 𝛽 1 𝑇−1212.5 + 𝛽 2 𝑅−12.444 +𝜖 𝐻 0 : 𝛽 12 = 𝛽 11 = 𝛽 22 = 0 𝐻 1 :𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝛽 ′ 𝑠 ≠0 𝑓 0 = (170.73−11.37) (5−2) 11.37 (16−6) = 46.72 Tabled F = f0.05;3,10 = 2.44 (p-value < 0.0001)

6-4 Other Aspects of Regression Example 6-9 OPTIONS NOOVP NODATE NONUMBER; DATA ex69; INPUT YIELD TEMP RATIO; TEMPC=TEMP-1212.5; RATIOC=RATIO-12.444; TEMRATC=TEMPC*RATIOC; TEMPCSQ=TEMPC**2; RATIOCSQ=RATIOC**2; CARDS; 49.0 1300 7.5 50.2 1300 9.0 50.5 1300 11.0 48.5 1300 13.5 47.5 1300 17.0 44.5 1300 23.0 28.0 1200 5.3 31.5 1200 7.5 34.5 1200 11.0 35.0 1200 13.5 38.0 1200 17.0 38.5 1200 23.0 15.0 1100 5.3 17.0 1100 7.5 20.5 1100 11.0 29.5 1100 17.0 PROC REG DATA=EX69; MODEL YIELD= TEMPC RATIOC TEMRATC TEMPCSQ RATIOCSQ/VIF; TITLE 'QUADRATIC REGRESSION MODEL - FULL MODEL'; MODEL YIELD=TEMPC RATIOC/VIF; TITLE 'LINEAR REGRESSION MODEL - REDUCED MODEL'; RUN; QUIT;

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression Residual Plots (b) The variance of the observations may by increasing with time or with the magnitude of yi or xi. Data transformation on the response y is often used to eliminate this problem ( 𝑦 , ln 𝑦, 1 𝑦 ). (c) Plots of residuals against 𝑦 𝑖 and xi also indicate inequality of variance. (d) Indicates model inadequacy; that is, higher-order terms should be added to the model, a transformation on the x-variable or the y-variable (or both) should be considered, or other regressors should be considered (quadratic or exponential model)

6-4 Other Aspects of Regression Example OPTIONS NOOVP NODATE NONUMBER; DATA BIDS; INFILE 'C:\Users\korea\Desktop\Working Folder 2017\imen214-stats\ch06\data\bids.dat'; INPUT PRICE QUANTITY BIDS; LOGPRICE=LOG(PRICE); RECPRICE=1/PRICE; QUANSQ=QUANTITY**2; ODS GRAPHICS ON; proc sgplot; scatter x= quantity y=price; TITLE 'Scatter Plot of PRICE vs. QUANTITY'; PROC REG DATA=BIDS; MODEL PRICE= QUANTITY; TITLE 'LINEAR REGRESSION OF PRICE VS. QUANTITY'; MODEL LOGPRICE= QUANTITY; TITLE 'LINEAR REGRESSION OF LOGPRICE VS. QUANTITY'; MODEL RECPRICE= QUANTITY; TITLE 'LINEAR REGRESSION OF RECPRICE VS. QUANTITY'; MODEL PRICE= QUANTITY QUANSQ; TITLE 'QUADRATIC REGRESSION OF PRICE VS. QUANTITY'; RUN; ods graphics off; QUIT; 153.32 1 4 74.11 7.2 10 29.72 16.7 5 54.67 11.9 68.39 9.3 119.04 3.7 116.14 1.7 6 146.49 0.1 9 81.81 7.8 19.58 18.4 141.08 2.9 101.72 4.7 24.88 17.4 19.43 39.63 11.2 151.13 1.6 7 79.18 7.3 204.94 0.2 81.06 6.8 37.62 11.4 8 17.13 20 3 37.81 13.4 130.72 1.8 2 26.07 18.5 39.59 14.7 66.2 9.1

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression 6-4.2 Categorical Regressors Many problems may involve qualitative or categorical variables. The usual method for the different levels of a qualitative variable is to use indicator variables. For example, to introduce the effect of two different operators into a regression model, we could define an indicator variable as follows:

6-4 Other Aspects of Regression Example 6-10 Y=gas mileage, x1=engine displacement, x2=horse power x3=0 if automatic transmission 1 if manual transmission 𝑌= 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 𝑥 3 +𝜖 if automatic (x3=0), then 𝑌= 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 +𝜖 if manual (x3=1), then 𝑌= 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 +𝜖 𝑌=( 𝛽 0 + 𝛽 3 )+ 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 +𝜖 It is unreasonable because x1, x2 effects to x3 are not involved in the model Interaction model: 𝑌= 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 𝑥 3 + 𝛽 13 𝑥 1 𝑥 3 + 𝛽 23 𝑥 2 𝑥 3 +𝜖 𝑌= 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 + 𝛽 3 𝑥 3 + 𝛽 13 𝑥 1 + 𝛽 23 𝑥 2 +𝜖 𝑌= 𝛽 0 + 𝛽 3 + 𝛽 1 + 𝛽 13 𝑥 1 + 𝛽 2 + 𝛽 23 𝑥 2 +𝜖

6-4 Other Aspects of Regression Dummy Variables Many times a qualitative variable seems to be needed in a regression model. This can be accomplished by creating dummy variables or indicator variables. If a qualitative variable has 𝑟 levels you will need 𝑟−1 dummy variables. Notice that in ANOVA if a treatment had 𝑟 levels it had 𝑟−1 degrees of freedom. The ith dummy variable is defined as 𝑋 𝑖 = 1 𝑖𝑓 𝑖𝑛 𝑖𝑡ℎ 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 0 𝑖𝑓 𝑛𝑜𝑡 𝑖𝑛 𝑖𝑡ℎ 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 i=1, 2, ⋯, 𝑟−1 This can be done automatically in PROC GLM by using the CLASSS statement as we did in ANOVA. Any dummy variables defined with respect to a qualitative variable must be treated as a group. Individual t-tests are not meaningful. Partial F-tests must be performed on the group of dummy variables.

6-4 Other Aspects of Regression Example 6-11 OPTIONS NOOVP NODATE NONUMBER; DATA EX611; INPUT FORM SCENT COLOR RESIDUE REGION QUALITY @@; IF REGION=1 THEN REGION1=0; ELSE REGION1=1; /* IF REGION=1이면 REGION1=0 이고 THEN EAST IF REGION=2이면 REGION1=1 이고 THEN WEST */ FR=FORM*REGION1; RR=RESIDUE*REGION1; CARDS; 6.3 5.3 4.8 3.1 1 91 4.4 4.9 3.5 3.9 1 87 3.9 5.3 4.8 4.7 1 82 5.1 4.2 3.1 3.6 1 83 5.6 5.1 5.5 5.1 1 83 4.6 4.7 5.1 4.1 1 84 4.8 4.8 4.8 3.3 1 90 6.5 4.5 4.3 5.2 1 84 8.7 4.3 3.9 2.9 1 97 8.3 3.9 4.7 3.9 1 93 5.1 4.3 4.5 3.6 1 82 3.3 5.4 4.3 3.6 1 84 5.9 5.7 7.2 4.1 2 87 7.7 6.6 6.7 5.6 2 80 7.1 4.4 5.8 4.1 2 84 5.5 5.6 5.6 4.4 2 84 6.3 5.4 4.8 4.6 2 82 4.3 5.5 5.5 4.1 2 79 4.6 4.1 4.3 3.1 2 81 3.4 5.0 3.4 3.4 2 83 6.4 5.4 6.6 4.8 2 81 5.5 5.3 5.3 3.8 2 84 4.7 4.1 5.0 3.7 2 83 4.1 4.0 4.1 4.0 2 80 PROC REG DATA=EX611; MODEL QUALITY=FORM RESIDUE REGION1; TITLE 'MODEL WITH DUMMY VARIABLE'; MODEL QUALITY=FORM RESIDUE REGION1 FR RR; TITLE 'INTERACTION MODEL WITH DUMMY VARIABLE'; RUN; QUIT;

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-3 Multiple Regression Example OPTIONS NOOVP NODATE NONUMBER; DATA appraise; INPUT price units age size parking area cond$ @@; IF COND='F‘ THEN COND1=1; ELSE COND1=0; IF COND='G‘ THEN COND2=1; ELSE COND2=0; CARDS; 90300 4 82 4635 0 4266 F 384000 20 13 17798 0 14391 G 157500 5 66 5913 0 6615 G 676200 26 64 7750 6 34144 E 165000 5 55 5150 0 6120 G 300000 10 65 12506 0 14552 G 108750 4 82 7160 0 3040 G 276538 11 23 5120 0 7881 G 420000 20 18 11745 20 12600 G 950000 62 71 21000 3 39448 G 560000 26 74 11221 0 30000 G 268000 13 56 7818 13 8088 F 290000 9 76 4900 0 11315 E 173200 6 21 5424 6 4461 G 323650 11 24 11834 8 9000 G 162500 5 19 5246 5 3828 G 353500 20 62 11223 2 13680 F 134400 4 70 5834 0 4680 E 187000 8 19 9075 0 7392 G 93600 4 82 6864 0 3840 F 110000 4 50 4510 0 3092 G 573200 14 10 11192 0 23704 E 79300 4 82 7425 0 3876 F 272000 5 82 7500 0 9542 E ods graphics on; PROC REG DATA=APPRAISE; MODEL PRICE=UNITS AGE AREA COND1 COND2/R; TITLE ‘REDUCED MODEL WITH DUMMY VARIABLE'; RUN; ods graphics off; QUIT;

6-3 Multiple Regression 𝑹 𝟐 𝑴𝑺𝑬 Full Model 0.9801 0.9746 34123 Reduced Model 0.9771 0.9737 34721 With dummy 0.9860 0.9821 28673

6-3 Multiple Regression Example

6-3 Multiple Regression

6-3 Multiple Regression

6-3 Multiple Regression

Analysis of Covariance (공분산분석) 6-4 Other Aspects of Regression Analysis of Covariance (공분산분석) Suppose we have the following setup. 2 3 Treatment 1 2 r X Y ⋯ 𝑋 1,1 𝑌 1,1 𝑋 2,1 𝑌 2,1 𝑋 𝑟,1 𝑌 𝑟,1 ⋮ 𝑋 1,𝑛 𝑌 1,𝑛 𝑋 2,𝑛 𝑌 2,𝑛 𝑋 𝑟,𝑛 𝑌 𝑟,𝑛 2 2 3 3 2 4 1 3 Y 2 1 3 2 4 1 3 4 1 2 4 1 3 3 1 4 4 4 1 4 1 X Suppose X and Y are linearly related. We are interested in comparing the means of Y at the different levels of the treatment. Suppose a plot of the data looks like the following. 공분산분석 (Analysis of Covariance : ANCOVA) 이란 분산분석과 회귀분석을 혼합한 형태의 분석으로서, 분산분석에서와 같이 독립적인 r개 모집단 평균 (처리효과)들 간에 차이가 있는가를 검증하고자 하는 경우에 사용되나, 자료 (Y )가 다른 어떤 변수와 함수관계(여기서는 1차식의 관계만 다룸)에 있다고 믿어질 때 매우 유용한 분석수단이다.

6-4 Other Aspects of Regression Why Use Covariates? Concomitant variables or covariates are used to adjust for factors that influence the Y measurements. In randomized block designs, we did the same thing, but there we could control the value of the block variable. Now we assume we can measure the variable, but not control it. The plot on the previous page demonstrates why we need covariates in some situations. If the covariate (X) was ignored we would most likely conclude that treatment level 3 resulted in a larger mean than 1 and 4 but not different from 2. If the linear relation is extended we see that the value of Y in level 3 could very well be less than that of 1, nearly equal to that of 4 and surely less than that of 2. One assumption we need, equivalent to the no interaction assumption in two-way ANOVA, is that the slopes of the linear relationship between X and Y is the same in each treatment level.

Checking for Equal Slopes 6-4 Other Aspects of Regression Checking for Equal Slopes The Model we fit first Treatment = 1 𝑌 1𝑗 = 𝛽 0 + 𝛼 1 +𝛽 𝑋 𝑗 +𝛼 𝛽 1 𝑋 𝑗 + 𝜀 1𝑗 Y-intercept = 𝛽 0 + 𝛼 1 slope=𝛽+𝛼 𝛽 1 ⋮ Treatment = r−1 𝑌 (𝑟−1)𝑗 = 𝛽 0 + 𝛼 𝑟−1 +𝛽 𝑋 𝑗 +𝛼 𝛽 𝑟−1 𝑋 𝑗 + 𝜀 (𝑟−1)𝑗 Y-intercept = 𝛽 0 + 𝛼 𝑟−1 slope=𝛽+𝛼 𝛽 𝑟−1 Treatment = r 𝑌 𝑟𝑗 = 𝛽 0 +𝛽 𝑋 𝑗 + 𝜀 𝑟𝑗 Y-intercept = 𝛽 0 slope=𝛽 The test of equal slopes is 𝐻 0 : 𝛼 𝛽 1 =𝛼 𝛽 2 =⋯=𝛼 𝛽 𝑟−1 =0 𝐻 1 :𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑛𝑜𝑡 𝑧𝑒𝑟𝑜 If we fail to reject this we return the model without the interaction term and test without the interaction term and test 𝐻 0 : 𝛼 1 = 𝛼 2 =⋯= 𝛼 𝑟−1 =0 𝐻 1 :𝑁𝑜𝑡 𝑎𝑙𝑙 𝑧𝑒𝑟𝑜

6-4 Other Aspects of Regression EXAMPLE Four different formulations of an industrial glue are being tested. The tensile strength of the glue is also related to the thickness. Five observations on strength (Y) and thickness (X) in 0.01 inches are obtained for each formulation. The data are shown in the following table. Glue Formation 1 2 3 4 y x 46.5 45.9 49.8 46.1 44.3 13 14 12 48.7 49.0 50.1 48.5 45.2 10 11 46.3 47.1 48.9 48.2 50.3 15 44.7 43.0 51.0 48.1 48.6 16

6-4 Other Aspects of Regression Example OPTIONS NOOVP NODATE NONUMBER; DATA GLUE; INPUT FORMULA STRENGTH THICK @@; CARDS; 1 46.5 13 1 45.9 14 1 49.8 12 1 46.1 12 1 44.3 14 2 48.7 12 2 49.0 10 2 50.1 11 2 48.5 12 2 45.2 14 3 46.3 15 3 47.1 14 3 48.9 11 3 48.2 11 3 50.3 10 4 44.7 16 4 43.0 15 4 51.0 10 4 48.1 12 4 48.6 11 Ods graphics on; PROC GLM DATA=GLUE; CLASS FORMULA; MODEL STRENGTH=FORMULA THICK FORMULA*THICK; TITLE 'ANALYSIS OF COVARIANCE WITH INTERACTION'; /* TEST FOR LINEARITY */ MODEL STRENGTH=FORMULA THICK/SOLUTION; LSMEANS FORMULA/PDIFF STDERR; TITLE 'ANALYSIS OF COVARIANCE WITHOUT INTERACTION'; /* TEST FOR Parallelism */ RUN; QUIT; SOLUTION produces a solution to the normal equations (parameter estimates). PROC GLM displays a solution by default when your model involves no classification variables, so you need this option only if you want to see the solution for models with classification effects. PDIFF requests that p-values for differences of the LS-means be produced. STDERR produces the standard error of the LS-means and the probability level for the hypothesis H0: LS-mean=0

6-4 Other Aspects of Regression PROC GLM DATA=GLUE; CLASS FORMULA; MODEL STRENGTH=FORMULA THICK FORMULA*THICK; TITLE 'ANALYSIS OF COVARIANCE WITH INTERACTION'; /* TEST FOR LINEARITY‘ */ 4개의 formula가 interaction이 없이 parallel한가?

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression PROC GLM DATA=GLUE; CLASS FORMULA; MODEL STRENGTH=FORMULA THICK/SOLUTION; LSMEANS FORMULA/PDIFF STDERR; TITLE 'ANALYSIS OF COVARIANCE WITHOUT INTERACTION'; /* test for parallelism */ 앞에서 결정한 parallel line이 동일한 intercept를 지나는가? 앞에서 결정한 parallel line의 기울기가 zero인가?

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression 6-4.3 Variable Selection Procedures Best Subsets Regressions Selection Techniques R2 MSE Cp 𝐶 𝑝 = 𝑆𝑆𝐸(𝑝) 𝜎 2 (𝐹𝑀) −𝑛+2𝑝

6-4 Other Aspects of Regression 6-4.3 Variable Selection Procedures Backward Elimination all regressors in the model t-test: smallest absolute t-value eliminated first Minitab 𝛼=0.10 for cut-off form, residue, region

6-4 Other Aspects of Regression 6-4.3 Variable Selection Procedures Forward Selection No regressors in the model largest absolute t-value added first Minitab 𝛼=0.25 for cut-off form, residue, region, scent

6-4 Other Aspects of Regression 6-4.3 Variable Selection Procedures Stepwise Regression begins with forward step, then backward elimination tin=tout Minitab 𝛼=0.15 for cut-off form, residue, region

6-4 Other Aspects of Regression Example OPTIONS NODATE NOOVP NONUMBER; DATA SALES; INFILE 'C:\Users\korea\Desktop\Working Folder 2017\imen214-stats\ch06\data\sales.dat'; INPUT SALES TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING; PROC CORR DATA=SALES; VAR SALES; WITH TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING; TITLE 'CORRELATIONS OF DEPENDENT WITH INDENDENTS'; VAR TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING; TITLE 'CORRELATIONS BETWEEN INDEPENDENT VARIABLES'; PROC REG DATA=SALES; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING/VIF R; TITLE 'REGRESSION MODEL WITH ALL VARIABLES'; PROC RSQUARE DATA=SALES CP; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING/ADJRSQ RMSE SSE SELECT=10; TITLE 'ALL POSSIBLE REGRESSIONS'; PROC STEPWISE DATA=SALES; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING/FORWARD; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING/BACKWARD; TITLE 'STEPWISE REGRESSION USING BACKWARD ELIMINATION'; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE ACCOUNTS WORKLOAD RATING; TITLE 'STEPWISE REGRESSION THE STEPWISE TECHNIQUE'; MODEL SALES=POTENT ADVERT SHARE ACCOUNTS/R; MODEL SALES=POTENT ADVERT SHARE CHANGE ACCOUNTS/R; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE/R; MODEL SALES=TIME POTENT ADVERT SHARE ACCOUNTS/R; MODEL SALES=TIME POTENT ADVERT SHARE CHANGE WORKLOAD/R; RUN; QUIT;

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression Example

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

All Possible Regressions 6-4 Other Aspects of Regression All Possible Regressions This is the brute force method of modeling. It is feasible if the number of independent variables is small (less than 10 or so) and the sample size is not too large. Some of the common quantities to look at are R-square should be large. Should be adequately increase when an additional variable is added. Adj R-square should not be much less than R-square. It should show an increase if a variable is added. Mallows Cp should be approximately the number of parameters in the model (including the y-intercept). This is a good measure to use to narrow down the possible models quickly, then use 1) and 2) to pick the final models. The model should make sense. Note: Many of the better methods of model selection are to time consuming to use on all possible regressions. A number of good models can be chosen and then use better methods.

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression Stepwise Regression Forward Selection: Begins with no variables in the model. Calculates simple linear model for each X and adds most significant. (if above stated p-value). Calculates all models with already added variables and each non-added variable. Most significant is added. (if above sated p-value) This process is continued until no variables can be added. Backward Elimination: Model with all variables is fit. Least significant variable is removed (if p-value is greater than specified limit) and the model is refit without this variable. This process is continued until no variables can be removed.

6-4 Other Aspects of Regression Stepwise Regression Stepwise Technique: This technique is a variation on the forward selection technique. After a variable is added, the least significant is also removed if it has a p-value greater than the specified limit. This accounts for multicollinearity to some degree. Typically you do not do a stepwise procedure if you do an all possible regressions and vice versa. Stepwise procedures are more economical than all possible regressions in large data sets. There is no guarantee that the stepwise procedures will end up with the same model or the “best” model.

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression Press Statistic The main purpose of many regression analyses is to predict Y for a future set of X’s. The problem is that we have only present Y’s and X’s to use to make a model, but we would like to evaluate the model by how well it estimates Y’s with new X’s. The Press Statistic tries to overcome this problem. It is similar to the DFFITS in that you remove one observation at a time. The parameters are then calculated and 𝑌 is calculated for the X’s of the observation that is removed. Once the 𝑌 𝑖 ’s are calculated in this manner for each observation (call them 𝑌 𝑖 ∗ ) the press statistic can be calculated. 𝑃𝑟𝑒𝑠𝑠= 𝑖=1 𝑛 𝑌 𝑖 − 𝑌 𝑖 ∗ 2 Notice that this is very similar to SSE. It is very computation intensive, however. The Press Statistic is obtained in SAS by using the r option on the model statement.

6-4 Other Aspects of Regression Validation Data Split Split data into a fitting portion and a validation portion. This should be done randomly. Perform the model fitting routine as discussed earlier using data in the fitting portion only. For each viable model compute the SSE using the observations in the validation data portion. The best model is the one that minimizes the SSE. Recalculate the chosen model using the entire data set. Notice this procedure requires a large enough data set to enable you to split a validation portion off and still have adequate data to evaluate models. The process is tedious in SAS, requiring multiple runs or fancy programming.

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression

6-4 Other Aspects of Regression Model R2 Adj R2 MSE PRESS Potent advert share accounts 0.9004 0.8805 453.8362 5804450 Potent advert share change accounts 0.9119 0.8888 437.9516 5470022 Time potent advert share change 0.9108 0.8873 440.7473 5681706 Time potent advert share accounts 0.9064 0.8817 451.6049 6339858 Time potent advert share change workload 0.9109 0.8812 452.6253 6286583 No one model could be used  confidence interval might be helpful to decide the best model or parsimony