BA 555 Practical Business Analysis Agenda Linear Regression Analysis Case Study: Cost of Manufacturing Computers Multiple Regression Analysis Dummy Variables
Regression Analysis A technique to examine the relationship between an outcome variable (dependent variable, Y) and a group of explanatory variables (independent variables, X1, X2, … Xk). The model allows us to understand (quantify) the effect of each X on Y. It also allows us to predict Y based on X1, X2, …. Xk.
Types of Relationship Linear Relationship Nonlinear Relationship Simple Linear Relationship Y = b0 + b1 X + e Multiple Linear Relationship Y = b0 + b1 X1 + b2 X2 + … + bk Xk + e Nonlinear Relationship Y = a0 exp(b1X+e) Y = b0 + b1 X1 + b2 X12 + e … etc. Will focus only on linear relationship.
Simple Linear Regression Model population True effect of X on Y Estimated effect of X on Y sample Key questions: 1. Does X have any effect on Y? 2. If yes, how large is the effect? 3. Given X, what is the estimated Y? ASSOCIATION ≠ CAUSALITY
Least Squares Method Least squares line: It is a statistical procedure for finding the “best-fitting” straight line. It minimizes the sum of squares of the deviations of the observed values of Y from those predicted Bad fit. Deviations are minimized.
Case: Cost of Manufacturing Computers (pp.13 – 45) A manufacturer produces computers. The goal is to quantify cost drivers and to understand the variation in production costs from week to week. The following production variables were recorded: COST: the total weekly production cost (in $millions) UNITS: the total number of units (in 000s) produced during the week. LABOR: the total weekly direct labor cost (in $10K). SWITCH: the total number of times that the production process was re-configured for different types of computers FACTA: = 1 if the observation is from factory A; = 0 if from factory B.
Raw Data (p. 14) How many possible regression models can we build?
Simple Linear Regression Model (pp. 17 – 26) Research Questions: Is Labor a significant cost driver? How accurate can Labor predict Cost?
Initial Analysis (pp. 15 – 16) Summary statistics + Plots (e.g., histograms + scatter plots) + Correlations Things to look for Features of Data (e.g., data range, outliers) do not want to extrapolate outside data range because the relationship is unknown (or un-established). Summary statistics and graphs. Is the assumption of linearity appropriate? Inter-dependence among variables? Any potential problem? Scatter plots and correlations.
Correlation (p. 15) Is the assumption of linearity appropriate? r (rho): Population correlation (its value most likely is unknown.) r: Sample correlation (its value can be calculated from the sample.) Correlation is a measure of the strength of linear relationship. Correlation falls between –1 and 1. No linear relationship if correlation is close to 0. But, …. r = –1 –1 < r < 0 r = 0 0 < r < 1 r = 1 r = –1 –1 < r < 0 r = 0 0 < r < 1 r = 1
Scatterplot (p. 16) and Correlation (p Scatterplot (p.16) and Correlation (p. 15) Checking the linearity assumption Sample size P-value for H0: r = 0 Ha: r ≠ 0 Is 0.9297 a r or r?
Hypothesis Testing for b (pp Hypothesis Testing for b (pp.18 – 19 ) Key Q1: Does X have any effect on Y? b0 or b0? b0 b1 or b1? b1 Sb0 H0: b1 = 0 Ha: b1 ≠ 0 Sb1 ** Divide the p-value by 2 for one-sided test. Make sure there is at least weak evidence for doing this step. Degrees of freedom = n – k – 1, where n = sample size, k = # of Xs.
Confidence Interval Estimation for b (pp Confidence Interval Estimation for b (pp. 19 – 20) Key Q2: How large is the effect? Q1: Does Labor have any impact on Cost → Hypothesis Testing Q2: If so, how large is the impact? → Confidence Interval Estimation b0 b1 Sb1 Sb0 Degrees of freedom = n – k – 1 k = # of independent variables
Prediction (pp. 25 – 26) Key Q3: What is the Y-prediction? What is the predicted production cost of a given week, say, Week 21 of the year that Labor = 5 (i.e., $50,000)? Point estimate: predicted cost = b0 + b1 (5) = 1.0867 + 0.0081 (5) = 1.12724 (million dollars). Margin of error? → Prediction Interval What is the average production cost of a typical week that Labor = 5? Point estimate: estimated cost = b0 + b1 (5) = 1.0867 + 0.0081 (5) = 1.12724 (million dollars). Margin of error? → Confidence Interval
Prediction vs. Confidence Intervals (pp. 25 – 26) ☻ ☻ ☻ ☻ ☻ ☻ ☺ ☺ ☺ ☺ ☺ ☺ Variation (margin of error) on both ends seems larger. Implication?
Analysis of Variance (p. 21) - Not very useful in simple regression. - Useful in multiple regression.
Sum of Squares (p.22) SSE = remaining variation that can not be explained by the model. Syy = Total variation in Y SSR = Syy – SSE = variation in Y that has been explained by the model.
Fit Statistics (pp. 23 – 24) 0.45199 x 0.45199 = 0.204295
Another Simple Regression Model: Cost = b0 + b1 Units + e (p. 27) A better model? Why?
Multiple Regression Model Cost = b0 + b1 Units + b2 Labor + e (p. 29) Test of Global Fit (p. 29) Marginal effect (p. 30) Adjusted R-sq (p. 30)
R-sq vs. Adjusted R-sq Independent variables R-sq Adjusted R-sq Labor 20.43% 18.84% Units 86.44% 86.17% Switch 0.05% -1.95% Labor, Units 86.51% 85.96% Units, Switch 88.20% 87.72% Labor, Switch 21.32% 18.11% Labor, Units, Switch 88.21% 87.48% Remember! There are still many more models to try.
Test of Global Fit (p.29) Variation explained by the model that consists of 2 Xs. Variation explained, on the average, by each independent variable. If F-ratio is large → H0 or Ha? If F-ratio is small → H0 or Ha? (please read pp. 39–41, 47 for finding the cutoff.) H0: the model is useless. Ha: the model is not completely useless.
Residual Analysis (pp.33 – 34) The three conditions required for the validity of the regression analysis are: the error variable is normally distributed with mean = 0. the error variance is constant for all values of x. the errors are independent of each other. How can we identify any violation?
Residual Analysis (pp. 33 – 34) We do not have e (random error), but we can calculate residuals from the sample. Residual = actual Y – estimated Y Examining the residuals (or standardized residuals), help detect violations of the required conditions.
Residuals, Standardized Residuals, and Studentized Residuals (p.33)
The random error e is normally distributed with mean = 0 (p.34)
The error variance se is constant for all values of X and estimated Y (p.34) Constant spread !
The spread increases with y Constant Variance When the requirement of a constant variance is violated we have a condition of heteroscedasticity. Diagnose heteroscedasticity by plotting the residual against the predicted y, actual y, and each independent variable X. Residual + + + + + + + + + + + + + ^ + + + y + + + + + + + + The spread increases with y ^
The errors are independent of each other (p.34) Do NOT want to see any pattern.
Non Independence of Error Variables Residual Residual + + + + + + + + + + + + + + + Time Time + + + + + + + + + + + + + Note the runs of positive residuals, replaced by runs of negative residuals Note the oscillating behavior of the residuals around zero.
Residual Plots with FACTA (p.34) Which factory is more efficient?
Dummy/Indicator Variables (p.36) Qualitative variables are handled in a regression analysis by the use of 0-1 variables. This kind of qualitative variables are also referred to as “dummy” variables. They indicate which category the corresponding observation belongs to. Use k–1 dummy variable for a qualitative variable with k categories. Gender = “M” or “F” → Needs one dummy variable. Training Level = “A”, “B”, or “C” → Needs 2 dummy variables.
Dummy Variables (pp. 36 – 38) A Parallel Lines Model: Cost = b0 + b1 Units + b2 FactA + e Least squares line: Estimated Cost = 0.86 + 0.27 Units – 0.0068 FactA Two lines? Base level?
Dummy Variables (pp. 36 – 38) An Interaction Model : Cost = b0 + b1 Units + b2 FactA + b3 Units_FactA + e Least squares line: Estimated Cost = 0.87 + 0.26 Units – 0.023 FactA + 0.016 Units_FactA
Models that I have tried (p. 41)
Statgraphics Prediction/Confidence Intervals for Y Simple Regression Analysis Relate / Simple Regression X = Independent variable, Y = dependent variable For prediction, click on the Tabular option icon and check Forecasts. Right click to change X values. Multiple Regression Analysis Relate / Multiple Regression For prediction, enter values of Xs in the Data Window and leave the corresponding Y blank. Click on the Tabular option icon and check Reports. Saving intermediate results (e.g., studentized residuals). Click the icon and check the results to save. Removing outliers. Highlight the point to remove on the plot and click the Exclude icon .
Regression Analysis Summary (pp. 43 – 44)