12/22/ lecture 171 STATS 330: Lecture 17
12/22/ lecture 172 Factors In the models discussed so far, all explanatory variables have been numeric Now we want to incorporate categorical variables into our models In R, categorical variables are called factors
12/22/ lecture 173 Example Consider an experiment to measure the rate of metal removal in a machining process on a lathe. The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)
12/22/ lecture 174 Data hardness setting rate slow slow slow slow slow medium medium medium medium medium fast fast fast fast fast 100
12/22/ lecture 175
12/22/ lecture 176 Model A model consisting of 3 parallel lines seems appropriate: Note same slope ie parallel lines Different intercepts
12/22/ lecture 177 Baseline version We can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”: Baseline Offset for medium line
12/22/ lecture 178 Baseline version (2) We can then write the model as
12/22/ lecture 179 “Deviation from mean” version Now let be the mean of F, M and S. Define “fast” line intercept Mean of intercepts
12/22/ lecture 1710 “Deviation from mean” version (2) Then Thus, is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.
12/22/ lecture 1711 Dummy variables Back to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define med = 1 if setting =“medium” and 0 otherwise slow = 1 if setting =“slow” and 0 otherwise Then we can write the model as
12/22/ lecture 1712 Fitting The model can be fitted as usual using lm: > med <-ifelse(metal.df$setting=="medium", 1,0) > slow<-ifelse(metal.df$setting=="slow", 1,0) > summary(lm(rate~med + slow + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * med *** slow e-07 *** hardness e-09 ***
12/22/ lecture 1713 Fitting (2) Thus, the baseline has intercept The “medium” line has intercept = The “slow” line has intercept =
12/22/ lecture 1714 baseline Offset m Offset s
12/22/ lecture 1715 Fitting (3) Making dummy variables is a pain. Fortunately R allows us to write > summary(lm(rate ~ setting + hardness)) Estimate Std.Error t-value Pr(>|t|) (Intercept) * settingmedium *** settingslow e-07 *** hardness e-09 *** and get the same result, provided the variable setting is a factor.
12/22/ lecture 1716 Factors Since the data for setting in the input data was character data, the variable setting was automatically recognized as a factor In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values? Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.
12/22/ lecture 1717 Factors (2) > rpm = rep(c(1000,1200,1400), c(5,5,5)) > summary(lm(rate~ rpm + hardness, data=metal.df)) Call: lm(formula = rate ~ rpm + hardness, data = metal.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-08 *** rpm e-07 *** hardness e-10 *** When rpm = 1000, the relationship is * * hardness i.e * hardness
12/22/ lecture 1718 Factors (3) InterceptSlope factornon-factorfactornon-factor Fast Medium Slow The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.
12/22/ lecture 1719 Factors (4) To avoid this, we could recode the variable as character, or (easier) Use the factor function to coerce the numerical data into a factor rpm.as.factor = factor(rpm)
12/22/ lecture 1720 Factors (5) We can fit the “factor” model using the R code > rpm.as.factor = factor(rpm) > summary(lm(rate~rpm.as.factor + hardness)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** rpm.as.factor *** rpm.as.factor e-07 *** hardness e-09 *** These estimates are different!! What’s going on??
12/22/ lecture 1721 Levels The different values of a factor are called “levels” The levels of the factor setting are fast, medium, slow > levels(setting) [1] "fast" "medium" "slow" The levels of the factor rpm.as.factor are 1000,1200,1400 > levels(rpm.as.factor) [1] "1000" "1200" "1400"
12/22/ lecture 1722 Levels (2) By default, the levels are listed in alphabetical order The first level is selected as the baseline Thus, using setting, the baseline is “fast” Using rpm.as.factor, the baseline is “1000”
12/22/ lecture 1723 Levels (3) > rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000")) > summary(lm(rate~rpm.newbaseline + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * rpm.newbaseline *** rpm.newbaseline e-07 *** hardness e-09 *** Can change the order using the factor function
12/22/ lecture 1724 Non-parallel lines What if the lines aren’t parallel? Then the betas are different: the model becomes
12/22/ lecture 1725 Baseline version for the betas As before, we can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”: Baseline Offset for medium line slope
12/22/ lecture 1726 Baseline version for both parameters We can then write the model as
12/22/ lecture 1727 Dummy variables for both parameters As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and h.med = hardness x med h.slow = hardness x slow Then we can write the model as
12/22/ lecture 1728 Fitting in R The model formula for this non-parallel model is rate ~ setting + hardness + setting:hardness or, even more compactly, as rate ~ setting * hardness > summary(lm(rate ~ setting*hardness)) Estimate Std. Error t value Pr(>|t|) (Intercept) settingmedium settingslow hardness e-07 *** settingmedium:hardness settingslow:hardness
12/22/ lecture 1729 Is the non-parallel model necessary? This amounts to testing if M and S are zero, or, equivalently, if the parallel model rate ~ setting + hardness is an an adequate submodel of the non-parallel model rate ~ setting * hardness As in Lecture 6, we use the anova function to compare the two models:
12/22/ lecture 1730 > model1<-lm(rate ~ setting + hardness) > model2<-lm(rate ~ setting * hardness) > anova(model1, model2) Analysis of Variance Table Model 1: rate ~ setting + hardness Model 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F) Conclusion: since the F-value is small and the p- value is large, we conclude that the submodel (ie the parallel lines model) is adequate.