General Qualitative Data, and “Dummy Variables” How might we have represented “make-of-car” in the motorpool case, had there been more than just two makes? – Assume that Make takes four categorical values (Ford, Honda, BMW, and Sterling). Choose one value as the “foundation” case. Create three 0/1 (“yes”/”no”, so-called “dummy”) variables for the other three cases. These three variables jointly represent the four-valued qualitative Make variable. Here are the details. Here We’ll use this representational trick in order to include “day of game” (either Friday, Saturday, or Sunday) in a model which predicts attendance at a professional indoor soccer team’s home games. Here is the example.Here – Using this trick requires that we extend the “significance level” (with respect to whether a variable “belongs” in the model) to groups of variables. This is done via “analysis of variance” (ANOVA).
Discounts on Car Purchases: Does Salesperson Identity Matter? Assume there are five salesfolks: Andy, Bob, Chuck, Dave and Ed Take one (e.g., Andy) as the foundation case, and add four new “dummy” variables D B = 1 only if Bob, 0 otherwise D C = 1 only if Chuck, 0 otherwise D D = 1 only if Dave, 0 otherwise D E = 1 only if Ed, 0 otherwise The coefficient of each (in the most-complete model) will differentiate the average discount that each salesperson gives a customer from the average discount Andy would give the same customer
Does Salesperson Identity Matter? Imagine that, after adding the new variables (four new columns of data) to your model, the regression yields: Discount pred = Age – Income Sex D B + (–300) D C + (–50) D D D E With similar customers, you’d expect Bob to give a discount $240 higher than would Andy With similar customers, you’d expect Chuck to give a discount $300 lower than would Andy, $540 lower than would Bob, and also lower than would Dave (by $250) and Ed (by $670)
Does “Salesperson” Interact with “Sex”? Are some of the salesfolk better at selling to a particular Sex of customer? – Add D B, D C, D D, D E, and D B Sex, D C Sex, D D Sex, D E Sex to the model – Imagine that your regression yields: Discount pred = Age Income Sex D B – 350 D C + 75 D D + 10 D E – 375 (D B Sex) – 150 (D C Sex) – 50 (D D Sex) (D E Sex) – Interpret this back in the “conceptual” model: Discount pred = Age – Income Sex + (240 – 375 Sex) D B + (–350 – 150 Sex) D C + (75 – 50 Sex) D D + ( Sex) D E
Discount pred = Age – Income Sex + (240 – 375 Sex) D B + (–350 – 150 Sex) D C + (75 – 50 Sex) D D + ( Sex) D E – Given a male (Sex=0) customer, you’d expect Bob (D B =1) to give a greater discount (by $240-$375 0 = $240) than Andy – Given a female (Sex=1) customer, you’d expect Bob to give a smaller discount (by $240-$375 1 = -$135) than Andy – Chuck has been giving smaller discounts to both men and women than has Andy, and Dave and Ed have been giving larger discounts than Andy to both sexes – And we could take the same approach to investigate whether “Salesperson” interacts with Age, including also D B Age, D C Age, D D Age, D E Age in our model
Outliers An outlier is a sample observation which fails to “fit” with the rest of the sample data. Such observations may distort the results of an entire study. – Types of outliers (three) – Identification of outliers (via “model analysis”) – Dealing with outliers (perhaps yielding a better model) These issues are dealt with here.here