Shonda Kuiper Grinnell College
Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory variable. Response variable measures the outcome of a study. Explanatory variable explain changes in the response variable.
Each variable can be classified as either categorical or quantitative. Categorical Quantitative Chi-Square test Two proportion test Two-sample t-test ANOVA Logistic Regression Regression Categorical data place individuals into one of several groups (such as red/blue/white, male/female or yes/no). Quantitative data consists of numerical values for which most arithmetic operations make sense.
= = where i =1,2 j = 1,2,3,4
The theoretical model used in the two-sample t-test is designed to account for these two group means ( µ 1 and µ 2 ) and random error. observed mean random value response error = + where i =1,2 j = 1,2,3,4 where i =1,2 j = 1,2,3,4
= where i = 1,2 and j = 1,2,3,4
+ observed mean random value response error = + where i =1,2 j = 1,2,3,4
observed mean random value response error = + where i =1,2 j = 1,2,3,4 where i = 1,2, …, 8
= where i = 1,2,…,8
= where i = 1,2,…,8
When there are only two groups (and we have the same assumptions), all three models are algebraically equivalent. where i =1,2 j = 1,2,3,4 where i =1,2 j = 1,2,3,4 where i = 1,2, …, 8
Shonda Kuiper Grinnell College
Multiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to: Describe: A model may be developed to describe the relationship between multiple explanatory variables and the response variable. Predict: A regression model may be used to generalize to observations outside the sample. Confirm: Theories are often developed about which variables or combination of variables should be included in a model. Hypothesis tests can be used to evaluate the relationship between the explanatory variables and the response.
Build a multiple regression model to predict retail price of cars Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004) Questions: What happens to Price as Mileage increases?
Build a multiple regression model to predict retail price of cars Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004) Questions: What happens to Price as Mileage increases? Since b 1 = is small can we conclude it is unimportant?
Build a multiple regression model to predict retail price of cars Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004) Questions: What happens to Price as Mileage increases? Since b 1 = is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you?
Build a multiple regression model to predict retail price of cars Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004) Questions: What happens to Price as Mileage increases? Since b 1 = is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you?
Build a multiple regression model to predict retail price of cars Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004) Questions: What happens to Price as Mileage increases? Since b 1 = is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you? Are there outliers or influential observations?
What happens when all the points fall on the regression line? 0
What happens when the regression line does not help us estimate Y?
R 2 adj includes a penalty when more terms are included in the model. n is the sample size and p is the number of coefficients (including the constant term β 0, β 1, β 2, β 3,…, β p-1 ) When many terms are in the model: p is larger R 2 adj is smaller (n – 1)/(n-p) is larger
Price = – 0.22 Mileage R-Sq: 4.1% Slope coefficient (b1): t = (p-value = 0.004)
Shonda Kuiper Grinnell College
Build a multiple regression model to predict retail price of cars R 2 = 2%
Build a multiple regression model to predict retail price of cars R 2 = 2% Mileage Cylinder Liter Leather Cruise Doors Sound
Build a multiple regression model to predict retail price of cars R 2 = 2% Mileage Cylinder Liter Leather Cruise Doors Sound Price = Cruise Cyl -1543Doors Leather - 787Liter -0.17Mileage Sound R 2 = 44.6%
Step Forward Regression (Forward Selection): Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56%
Step Forward Regression: Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56% Price = CylR 2 = 32.39%
Step Forward Regression: Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56% Price = CylR 2 = 32.39% Price = – 0.17MileageR 2 = 2.04%
Step Forward Regression: Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56% Price = CylR 2 = 32.39% Price = – 0.17MileageR 2 = 2.04% Price = LiterR 2 = 31.15%
Step Forward Regression: Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56% Price = CylR 2 = 32.39% Price = – 0.17MileageR 2 = 2.04% Price = LiterR 2 = 31.15% Price = – SoundR 2 = 1.55% Price = LeatherR 2 = 2.47% Price = DoorsR 2 = 1.93%
Step Forward Regression: Which combination of two terms best predicts Price? Price = CylR 2 = 32.39% Price = Cyl CruiseR 2 = 38.4% (38.2%)
Step Forward Regression: Which combination of two terms best predicts Price? Price = CylR 2 = 32.39% Price = Cyl – 0.152MileageR 2 = 34% (33.8)
Step Forward Regression: Which combination of two terms best predicts Price? Price = CylR 2 = 32.39% Price = Cyl LiterR 2 = 32.6% (32.4%)
Step Forward Regression: Which combination of terms best predicts Price? Price = CylR 2 = 32.39% Price = Cyl CruiseR 2 = 38.4% (38.2%) Price = Cyl +6362Cruise Leather R 2 = 40.4% (40.2%) Price = Cyl +6492Cruise Leather -0.17Mileage R 2 = 42.3% (42%) Price = Cyl +6320Cruise Leather -0.17Mileage – 1402Doors R 2 = 43.7% (43.3%) Price = Cyl Cruise Leather -0.17Mileage – 1463Doors – 2024Sound R 2 = 44.6% (44.15%) Price = Cyl Cruise Leather -787Liter -0.17Mileage -1543Doors Sound R 2 = 44.6% (44.14%)
Step Forward Regression: Which single explanatory variable best predicts Price? Price = CruiseR 2 = 18.56% Price = CylR 2 = 32.39% Price = – 0.17MileageR 2 = 2.04% Price = LiterR 2 = 31.15% Price = – SoundR 2 = 1.55% Price = LeatherR 2 = 2.47% Price = DoorsR 2 = 1.93%
Step Backward Regression (Backward Elimination): Price = Cyl Cruise Leather -0.17Mileage – 1463Doors – 2024Sound R 2 = 44.6% (44.15%) Price = Cyl Cruise Leather -787Liter -0.17Mileage -1543Doors Sound R 2 = 44.6% (44.14%) Other techniques, such as Akaike information criterion, Bayesian information criterion, Mallows’ Cp, are often used to find the best model. Bidirectional stepwise procedures
Best Subsets Regression: Here we see that Liter is the second best single predictor of price.
Important Cautions: Stepwise regression techniques can often ignore very important explanatory variables. Best subsets is often preferable. Both best subsets and stepwise regression methods only consider linear relationships between the response and explanatory variables. Residual graphs are still essential in validating whether the model is appropriate. Transformations, interactions and quadratic terms can often improve the model. Whenever these iterative variable selections techniques are used, the p-values corresponding to the significance of each individual coefficient are not reliable.