Learning Objectives Describe the linear regression model State the regression modeling steps Explain least squares Compute regression coefficients Describe residual analysis Predict the response variable Understand correlational analysis As a result of this class, you will be able to... 2
Probabilistic Models Hypothesize 2 components Deterministic Random error Example: Sales volume is 10 times advertising spending plus random error Y = 10X + e Random error may be due to factors other than advertising 6
Types of Probabilistic Models 7
Regression Models Answer ‘What is the relationship between the variables?’ Equation used 1 numerical dependent (response) variable What is to be predicted 1 or more numerical or categorical independent (explanatory) variables Used mainly for prediction 8
Regression Modeling Steps Define problem or question Specify model Collect data Do descriptive data analysis Estimate unknown parameters Evaluate model Use model for prediction 9
Problem Definition Most critical step What are the model objectives? Don’t want right answer to wrong question What are the model objectives? Who will use the model? What will be the benefits? Are resources available (data etc.)? How will the results be implemented? 12
Specifying the Model Define variables Conceptual (e.g., advertising, price) Empirical (e.g., list price, regular price) Measurement (e.g., $, units) Hypothesize nature of relationship Expected effects (i.e., coefficients’ signs) Functional form (linear or non-linear) Interactions 15
Model Specification Is Based on Theory Economic & business theory Mathematical theory Previous research ‘Common sense’ 16
Types of Regression Models This teleology is based on the number of explanatory variables & nature of relationship between X & Y. 25
Linear Equations High School Teacher © 1984-1994 T/Maker Co. 28
Linear Regression Model Relationship between variables is a linear function Population Y-Intercept Population Slope Random Error Dependent (Response) Variable Independent (Explanatory) Variable 29
Sample Linear Regression Model ei = Random error Unsampled observation Observed value 36
Scatter Diagram Plot of all (Xi, Yi) pairs Suggests how well model will fit 39
Thinking Challenge How would you draw a line through the points? How do you determine which line ‘fits best’? Alone Group Class 42
Least Squares ‘Best fit’ means difference between actual Y values & predicted Y values are a minimum But positive differences off-set negative LS minimizes the sum of the squared differences (or errors) 51
Least Squares Graphically 52
Coefficient Equations Sample regression equation # (Xi, Yi) pairs Sample slope Average Xi’s, then square Sample Y-intercept 53
Computation Table 54
Interpretation of Coefficients Slope (b1) Estimated Y changes by b1 for each 1 unit increase in X Example: If b1 = 2, then Sales (Y) is expected to increase by 2 for each 1 unit increase in Advertising (X) Y-Intercept (b0) Average value of Y when X = 0 Example: If b0 = 4, then average Sales (Y) is expected to be 4 when Advertising (X) is 0 55
Parameter Estimation Example You’re a marketing analyst for Hasbro Toys. You gather the following data: Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 What is the relationship between sales & advertising? 56
Scatter Diagram Sales vs. Advertising 57
Parameter Estimation Solution Table 58
Coefficient Interpretation Solution Slope (b1) Sales Volume (Y) is expected to increase by .7 units for each $1 increase in Advertising (X) Y-Intercept (b0) Average value of Sales Volume (Y) is -.10 units when Advertising (X) is 0 Difficult to explain to Marketing Manager Expect some sales without advertising 60
Parameter Estimation Excel Output bP b0 b1 61
Evaluating the Model How well does the model describe the relationship between the variables? Closeness of ‘best fit’ Closer the points to the line the better Assumptions met Significance of parameter estimates 71
Evaluating Model Steps Examine variation measures Do residual analysis Test coefficients for significance 72
Random Error Variation Variation of actual Y from predicted Y Measured by standard error of estimate Sample standard deviation of e Denoted SYX Affects several factors Parameter significance Prediction accuracy 75
Standard Error of Estimate The mean error is 0. 76
Measures of Variation in Regression Total sum of squares (SST) Measures variation of observed Yi around the mean`Y Explained variation (SSR) Variation due to relationship between X & Y Unexplained variation (SSE) Variation due to other factors 77
Variation Measures Yi Unexplained sum of squares (Yi - Yi)2 ^ Total sum of squares (Yi -`Y)2 Explained sum of squares (Yi -`Y)2 ^ 78
Coefficient of Determination Proportion of variation ‘explained’ by relationship between X & Y 0 £ r2 £ 1 79
r 2 Examples r2 = 1 r2 = 1 r2 = .8 r2 = 0 80
Adjusted Coefficient of Determination Proportion of variation ‘explained’ by relationship between X & Y Reflects Sample size Number of independent variables 81
Coef. of Determination Excel Output r2 adjusted for number of explanatory variables & sample size SYX 86
Residual Analysis Graphical analysis of residuals Purposes Plot residuals vs. Xi values Residuals are also called errors Difference between actual Yi & predicted Yi Purposes Examine functional form (linear vs. non-linear model) Evaluate violations of assumptions 89
Linear Regression Assumptions Normality Y values are normally distributed for each X Probability distribution of error is normal Homoscedasticity (constant variance) Independence of errors Linearity 90
Residual Plot for Functional Form Add X2 Term Correct Specification 92
Residual Plot for Homoscedasticity Heteroscedasticity Correct Specification Fan-shaped. Standardized residuals used typically. 93
Residual Plot for Independence Not Independent Correct Specification Plots reflect sequence data were collected. 94
Residual Analysis Excel Output The plot is standardized (student) residuals for each observation. For observation 5, the standardized residual is large. You can save the residuals & do descriptive analysis on them, including a normal probability plot. There are not enough observations here to make further analysis meaningful. 95
Residual Plot Excel Output
Test of Slope Coefficient Tests if there is a linear relationship between X & Y Involves population slope b1 Hypotheses H0: b1 = 0 (No linear relationship) H1: b1 ¹ 0 (Linear relationship) Theoretical basis is sampling distribution of slopes 101
Test of Slope Parameter Solution H0: b1 = 0 H1: b1 ¹ 0 a = .05 df = 5 - 2 = 3 Critical Value(s): Test Statistic: Decision: Conclusion: Reject at a = .05 There is evidence of a relationship 109
Test Statistic Solution 110
Test of Slope Parameter Excel Output ‘Standard Error’ is the estimated standard deviation of the sampling distribution, sbP. bP Sb t = bP /Sb P P P-Value 111
Prediction With Regression Models Types of predictions Point estimates Interval estimates What is predicted Population mean response (mYX) for given X Point on population regression line Individual response (Yi) for given X 114
What Is Predicted 115
Factors Affecting Interval Width Level of confidence (1 - a) Width increases as confidence increases Data dispersion (SYX) Width increases as variation increases Sample size Width decreases as sample size increases Distance of Xgiven from mean`X Width increases as distance increases 117
Regression Cautions Violated assumptions Relevancy of historical data Level of significance Extrapolation Cause & effect Relevancy of Historical Data Even if interpolating, conditions may have changed. Level of Significance r2 may be high, but at what level? Extrapolation Prediction Outside the Range of X Values Used to Develop Equation Interpolation Prediction Within the Range of X Values Used to Develop Equation Based on smallest & largest X Values Cause & Effect The # of teachers is highly correlated with liquor consumption due to population size! 126
Extrapolation Extrapolation Prediction Outside the Range of X Values Used to Develop Equation Interpolation Prediction Within the Range of X Values Used to Develop Equation Based on smallest & largest X Values 127
Cause & Effect Liquor Consumption # Teachers The # of teachers is highly correlated with liquor consumption due to population size! # Teachers 128
Types of Probabilistic Models 130
Correlation Models Answer ‘How strong is the linear relationship between 2 variables?’ Coefficient of correlation used Population correlation coefficient denoted r (rho) Values range from -1 to +1 Measures degree of association Used mainly for understanding 131
Sample Coefficient of Correlation Pearson Product-Moment Coefficient of Correlation: 132
Correlation & Regression Line 141
Test of Correlation Coefficient Shows if there is a linear relationship between 2 numerical variables Same conclusion as testing population slope b1 Hypotheses H0: r = 0 (No correlation) H1: r ¹ 0 (Correlation) 142
Conclusion Described the linear regression model Stated the regression modeling steps Explained least squares Computed regression coefficients Described residual analysis Predicted the response variable As a result of this class, you will be able to... 143
Learning Objectives Explain the linear multiple regression model Interpret linear multiple regression computer output Explain multicollinearity As a result of this class, you will be able to... 2
Multiple Regression Models 10
Linear Multiple Regression Model Relationship between 1 dependent & 2 or more independent variables is a linear function Population Y-intercept Population slopes Random error Dependent (response) variable Independent (explanatory) variables 11
Population Multiple Regression Model Bivariate model 12
Sample Multiple Regression Model Bivariate model 13
Regression Modeling Steps Define problem or question Specify model Collect data Do descriptive data analysis Estimate unknown parameters Evaluate model Use model for prediction 14
Linear Multiple Regression Model Parameter Estimation Linear Multiple Regression Model 15
Multiple Linear Regression Equations Too complicated by hand! Ouch! 16
Interpretation of Estimated Coefficients Slope (bP) Estimated Y changes by bP for each 1 unit increase in XP holding all other variables constant Example: If b1 = 2, then Sales (Y) is expected to increase by 2 for each 1 unit increase in Advertising (X1) given the Number of Sales Rep’s (X2) Y-Intercept (b0) Average value of Y when XP = 0 17
Parameter Estimation Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) & newspaper circulation (000) on the number of ad responses (00). You’ve collected the following data: Resp Size Circ 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6 Is this model specified correctly? What other variables could be used (color, photo’s etc.)? 18
Parameter Estimation Excel Output bP b0 b2 b1 19
Interpretation of Coefficients Solution Slope (b1) # Responses to Ad is expected to increase by .2049 (20.49) for each 1 sq. in. increase in Ad Size holding Circulation constant Slope (b2) # Responses to Ad is expected to increase by .2805 (28.05) for each 1 unit (1,000) increase in Circulation holding Ad Size constant Y-intercept is difficult to interpret. How can you have any responses with no circulation? 20
Evaluating the Model 21
Regression Modeling Steps Define problem or question Specify model Collect data Do descriptive data analysis Estimate unknown parameters Evaluate model Use model for prediction F 22
Evaluating Multiple Regression Model Steps Examine variation measures Do residual analysis Test parameter significance Overall model Portions of model Individual coefficients Test for multicollinearity 23
Coef. of Determination Excel Output r2Y.12 r2adj means 95.61% of variation in Y is due to Ad Size & Circulation SYX 29
Coefficient of Partial Determination Proportion of variation in Y ‘explained’ by variable XP holding all others constant Must estimate separate models Denoted r2Y1.2 in two X variables case Coefficient of partial determination of X1 with Y holding X2 constant Useful in selecting X variables 30
r 2Y1.2 Excel Output ANOVA df SS Regression 2 9.2497 Residual 3 0.2503 Total 5 9.5000 32
Testing Parameters 33
Evaluating Multiple Regression Model Steps Expanded! Examine variation measures Do residual analysis Test parameter significance Overall model Portions of model Individual coefficients Test for multicollinearity F New! New! New! 34
Testing Overall Significance Shows if there is a linear relationship between all X variables together & Y Uses F test statistic Hypotheses H0: b1 = b2 = ... = bP = 0 No linear relationship H1: At least one coefficient is not 0 At least one X variable affects Y Less chance of error than separate t-tests on each coefficient. Doing a series of t-tests leads to a higher overall Type I error than a. 35
Overall Significance Excel Output n - P -1 MSR / MSE n - 1 P-value 36
Testing Model Portions Examines the contribution of a set of X variables to the relationship with Y Null hypothesis: Variables in set do not improve significantly the model when all other variables are included Must estimate separate models Used in selecting X variables 37
Testing Model Portions Test Statistic Test H0: b1 = 0 in a 2 variable model From ANOVA section of regression for From ANOVA section of regression for 38
Multicollinearity 39
Evaluating Multiple Regression Model Steps Expanded! Examine variation measures Do residual analysis Test parameter significance Overall model Portions of model Individual coefficients Test for multicollinearity New! New! New! F 40
Multicollinearity High correlation between X variables Coefficients measure combined effect Leads to unstable coefficients depending on X variables in model Always exists; matter of degree Example: Using both Sales & Profit as explanatory variables in same model 41
Detecting Multicollinearity Examine correlation matrix Correlations between pairs of X variables are more than with Y variable Examine variance inflation factor (VIF) If VIFj > 5, multicollinearity exists Few remedies Obtain new sample data Eliminate one correlated X variable 42
Correlation Matrix Excel Output rY1 rY2 r12 43
VIF Excel Output Regress X1 on X2 44
This Class... Please take a moment to answer the following questions in writing: What was the most important thing you learned in class today? What do you still have questions about? How can today’s class be improved? As a result of this class, you will be able to... 144 10