Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been.

Similar presentations


Presentation on theme: "12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been."— Presentation transcript:

1 12/22/2015330 lecture 171 STATS 330: Lecture 17

2 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been numeric  Now we want to incorporate categorical variables into our models  In R, categorical variables are called factors

3 12/22/2015330 lecture 173 Example  Consider an experiment to measure the rate of metal removal in a machining process on a lathe.  The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)

4 12/22/2015330 lecture 174 Data hardness setting rate 1 120 slow 68 2 140 slow 90 3 150 slow 98 4 125 slow 77 5 136 slow 88 6 165 medium 122 7 140 medium 104 8 120 medium 75 9 125 medium 84 10 133 medium 95 11 175 fast 138 12 132 fast 102 13 124 fast 93 14 141 fast 112 15 130 fast 100

5 12/22/2015330 lecture 175

6 12/22/2015330 lecture 176 Model A model consisting of 3 parallel lines seems appropriate: Note same slope  ie parallel lines Different intercepts

7 12/22/2015330 lecture 177 Baseline version We can regard the fast setting as a baseline  and express the other settings as “baseline plus offsets”: Baseline  Offset for medium line

8 12/22/2015330 lecture 178 Baseline version (2) We can then write the model as

9 12/22/2015330 lecture 179 “Deviation from mean” version Now let  be the mean of  F,  M and  S. Define “fast” line intercept Mean of intercepts

10 12/22/2015330 lecture 1710 “Deviation from mean” version (2) Then Thus,  is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.

11 12/22/2015330 lecture 1711 Dummy variables Back to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define med = 1 if setting =“medium” and 0 otherwise slow = 1 if setting =“slow” and 0 otherwise Then we can write the model as

12 12/22/2015330 lecture 1712 Fitting The model can be fitted as usual using lm: > med <-ifelse(metal.df$setting=="medium", 1,0) > slow<-ifelse(metal.df$setting=="slow", 1,0) > summary(lm(rate~med + slow + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * med -9.44980 1.87275 -5.046 0.000374 *** slow -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 ***

13 12/22/2015330 lecture 1713 Fitting (2) Thus, the baseline has intercept -22.17042 The “medium” line has intercept -22.17042 -9.44980 = -31.62022 The “slow” line has intercept -22.17042 -19.00757 = -41.17799

14 12/22/2015330 lecture 1714 baseline Offset  m Offset  s

15 12/22/2015330 lecture 1715 Fitting (3) Making dummy variables is a pain. Fortunately R allows us to write > summary(lm(rate ~ setting + hardness)) Estimate Std.Error t-value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * settingmedium -9.44980 1.87275 -5.046 0.000374 *** settingslow -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** and get the same result, provided the variable setting is a factor.

16 12/22/2015330 lecture 1716 Factors  Since the data for setting in the input data was character data, the variable setting was automatically recognized as a factor  In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values?  Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.

17 12/22/2015330 lecture 1717 Factors (2) > rpm = rep(c(1000,1200,1400), c(5,5,5)) > summary(lm(rate~ rpm + hardness, data=metal.df)) Call: lm(formula = rate ~ rpm + hardness, data = metal.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -88.674624 7.837602 -11.31 9.29e-08 *** rpm 0.047519 0.004521 10.51 2.09e-07 *** hardness 0.934226 0.047944 19.49 1.89e-10 *** When rpm = 1000, the relationship is -88.674624 + 0.047519 * 1000 + 0.934226 * hardness i.e. -41.15562 + 0.934226 * hardness

18 12/22/2015330 lecture 1718 Factors (3) InterceptSlope factornon-factorfactornon-factor Fast-22.17042-22.148020.934260.93423 Medium-31.62022-31.651820.934260.93423 Slow-41.17799-41.155620.934260.93423 The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.

19 12/22/2015330 lecture 1719 Factors (4)  To avoid this, we could recode the variable as character, or (easier) Use the factor function to coerce the numerical data into a factor rpm.as.factor = factor(rpm)

20 12/22/2015330 lecture 1720 Factors (5) We can fit the “factor” model using the R code > rpm.as.factor = factor(rpm) > summary(lm(rate~rpm.as.factor + hardness)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.17799 6.84927 -6.012 8.77e-05 *** rpm.as.factor1200 9.55777 1.86692 5.120 0.000334 *** rpm.as.factor1400 19.00757 1.88875 10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** These estimates are different!! What’s going on??

21 12/22/2015330 lecture 1721 Levels  The different values of a factor are called “levels”  The levels of the factor setting are fast, medium, slow > levels(setting) [1] "fast" "medium" "slow"  The levels of the factor rpm.as.factor are 1000,1200,1400 > levels(rpm.as.factor) [1] "1000" "1200" "1400"

22 12/22/2015330 lecture 1722 Levels (2)  By default, the levels are listed in alphabetical order  The first level is selected as the baseline  Thus, using setting, the baseline is “fast” Using rpm.as.factor, the baseline is “1000”

23 12/22/2015330 lecture 1723 Levels (3) > rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000")) > summary(lm(rate~rpm.newbaseline + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * rpm.newbaseline1200 -9.44980 1.87275 -5.046 0.000374 *** rpm.newbaseline1000 -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** Can change the order using the factor function

24 12/22/2015330 lecture 1724 Non-parallel lines  What if the lines aren’t parallel? Then the betas are different: the model becomes

25 12/22/2015330 lecture 1725 Baseline version for the betas As before, we can regard the fast setting as a baseline  and express the other settings as “baseline plus offsets”: Baseline  Offset for medium line slope

26 12/22/2015330 lecture 1726 Baseline version for both parameters We can then write the model as

27 12/22/2015330 lecture 1727 Dummy variables for both parameters As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and h.med = hardness x med h.slow = hardness x slow Then we can write the model as

28 12/22/2015330 lecture 1728 Fitting in R The model formula for this non-parallel model is rate ~ setting + hardness + setting:hardness or, even more compactly, as rate ~ setting * hardness > summary(lm(rate ~ setting*hardness)) Estimate Std. Error t value Pr(>|t|) (Intercept) -12.18162 10.32795 -1.179 0.2684 settingmedium -30.15725 15.49375 -1.946 0.0834. settingslow -33.60120 19.58902 -1.715 0.1204 hardness 0.86312 0.07295 11.831 8.69e-07 *** settingmedium:hardness 0.14961 0.11125 1.345 0.2116 settingslow:hardness 0.10546 0.14356 0.735 0.4813

29 12/22/2015330 lecture 1729 Is the non-parallel model necessary? This amounts to testing if  M and  S are zero, or, equivalently, if the parallel model rate ~ setting + hardness is an an adequate submodel of the non-parallel model rate ~ setting * hardness As in Lecture 6, we use the anova function to compare the two models:

30 12/22/2015330 lecture 1730 > model1<-lm(rate ~ setting + hardness) > model2<-lm(rate ~ setting * hardness) > anova(model1, model2) Analysis of Variance Table Model 1: rate ~ setting + hardness Model 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F) 1 11 95.451 2 9 78.807 2 16.644 0.9504 0.4222 Conclusion: since the F-value is small and the p- value 0.4222 is large, we conclude that the submodel (ie the parallel lines model) is adequate.


Download ppt "12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been."

Similar presentations


Ads by Google