Download presentation
Presentation is loading. Please wait.
Published byJesse Kennedy Modified over 9 years ago
1
12/22/2015330 lecture 171 STATS 330: Lecture 17
2
12/22/2015330 lecture 172 Factors In the models discussed so far, all explanatory variables have been numeric Now we want to incorporate categorical variables into our models In R, categorical variables are called factors
3
12/22/2015330 lecture 173 Example Consider an experiment to measure the rate of metal removal in a machining process on a lathe. The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)
4
12/22/2015330 lecture 174 Data hardness setting rate 1 120 slow 68 2 140 slow 90 3 150 slow 98 4 125 slow 77 5 136 slow 88 6 165 medium 122 7 140 medium 104 8 120 medium 75 9 125 medium 84 10 133 medium 95 11 175 fast 138 12 132 fast 102 13 124 fast 93 14 141 fast 112 15 130 fast 100
5
12/22/2015330 lecture 175
6
12/22/2015330 lecture 176 Model A model consisting of 3 parallel lines seems appropriate: Note same slope ie parallel lines Different intercepts
7
12/22/2015330 lecture 177 Baseline version We can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”: Baseline Offset for medium line
8
12/22/2015330 lecture 178 Baseline version (2) We can then write the model as
9
12/22/2015330 lecture 179 “Deviation from mean” version Now let be the mean of F, M and S. Define “fast” line intercept Mean of intercepts
10
12/22/2015330 lecture 1710 “Deviation from mean” version (2) Then Thus, is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.
11
12/22/2015330 lecture 1711 Dummy variables Back to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define med = 1 if setting =“medium” and 0 otherwise slow = 1 if setting =“slow” and 0 otherwise Then we can write the model as
12
12/22/2015330 lecture 1712 Fitting The model can be fitted as usual using lm: > med <-ifelse(metal.df$setting=="medium", 1,0) > slow<-ifelse(metal.df$setting=="slow", 1,0) > summary(lm(rate~med + slow + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * med -9.44980 1.87275 -5.046 0.000374 *** slow -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 ***
13
12/22/2015330 lecture 1713 Fitting (2) Thus, the baseline has intercept -22.17042 The “medium” line has intercept -22.17042 -9.44980 = -31.62022 The “slow” line has intercept -22.17042 -19.00757 = -41.17799
14
12/22/2015330 lecture 1714 baseline Offset m Offset s
15
12/22/2015330 lecture 1715 Fitting (3) Making dummy variables is a pain. Fortunately R allows us to write > summary(lm(rate ~ setting + hardness)) Estimate Std.Error t-value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * settingmedium -9.44980 1.87275 -5.046 0.000374 *** settingslow -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** and get the same result, provided the variable setting is a factor.
16
12/22/2015330 lecture 1716 Factors Since the data for setting in the input data was character data, the variable setting was automatically recognized as a factor In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values? Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.
17
12/22/2015330 lecture 1717 Factors (2) > rpm = rep(c(1000,1200,1400), c(5,5,5)) > summary(lm(rate~ rpm + hardness, data=metal.df)) Call: lm(formula = rate ~ rpm + hardness, data = metal.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -88.674624 7.837602 -11.31 9.29e-08 *** rpm 0.047519 0.004521 10.51 2.09e-07 *** hardness 0.934226 0.047944 19.49 1.89e-10 *** When rpm = 1000, the relationship is -88.674624 + 0.047519 * 1000 + 0.934226 * hardness i.e. -41.15562 + 0.934226 * hardness
18
12/22/2015330 lecture 1718 Factors (3) InterceptSlope factornon-factorfactornon-factor Fast-22.17042-22.148020.934260.93423 Medium-31.62022-31.651820.934260.93423 Slow-41.17799-41.155620.934260.93423 The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.
19
12/22/2015330 lecture 1719 Factors (4) To avoid this, we could recode the variable as character, or (easier) Use the factor function to coerce the numerical data into a factor rpm.as.factor = factor(rpm)
20
12/22/2015330 lecture 1720 Factors (5) We can fit the “factor” model using the R code > rpm.as.factor = factor(rpm) > summary(lm(rate~rpm.as.factor + hardness)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.17799 6.84927 -6.012 8.77e-05 *** rpm.as.factor1200 9.55777 1.86692 5.120 0.000334 *** rpm.as.factor1400 19.00757 1.88875 10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** These estimates are different!! What’s going on??
21
12/22/2015330 lecture 1721 Levels The different values of a factor are called “levels” The levels of the factor setting are fast, medium, slow > levels(setting) [1] "fast" "medium" "slow" The levels of the factor rpm.as.factor are 1000,1200,1400 > levels(rpm.as.factor) [1] "1000" "1200" "1400"
22
12/22/2015330 lecture 1722 Levels (2) By default, the levels are listed in alphabetical order The first level is selected as the baseline Thus, using setting, the baseline is “fast” Using rpm.as.factor, the baseline is “1000”
23
12/22/2015330 lecture 1723 Levels (3) > rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000")) > summary(lm(rate~rpm.newbaseline + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * rpm.newbaseline1200 -9.44980 1.87275 -5.046 0.000374 *** rpm.newbaseline1000 -19.00757 1.88875 -10.064 6.94e-07 *** hardness 0.93426 0.05008 18.654 1.13e-09 *** Can change the order using the factor function
24
12/22/2015330 lecture 1724 Non-parallel lines What if the lines aren’t parallel? Then the betas are different: the model becomes
25
12/22/2015330 lecture 1725 Baseline version for the betas As before, we can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”: Baseline Offset for medium line slope
26
12/22/2015330 lecture 1726 Baseline version for both parameters We can then write the model as
27
12/22/2015330 lecture 1727 Dummy variables for both parameters As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and h.med = hardness x med h.slow = hardness x slow Then we can write the model as
28
12/22/2015330 lecture 1728 Fitting in R The model formula for this non-parallel model is rate ~ setting + hardness + setting:hardness or, even more compactly, as rate ~ setting * hardness > summary(lm(rate ~ setting*hardness)) Estimate Std. Error t value Pr(>|t|) (Intercept) -12.18162 10.32795 -1.179 0.2684 settingmedium -30.15725 15.49375 -1.946 0.0834. settingslow -33.60120 19.58902 -1.715 0.1204 hardness 0.86312 0.07295 11.831 8.69e-07 *** settingmedium:hardness 0.14961 0.11125 1.345 0.2116 settingslow:hardness 0.10546 0.14356 0.735 0.4813
29
12/22/2015330 lecture 1729 Is the non-parallel model necessary? This amounts to testing if M and S are zero, or, equivalently, if the parallel model rate ~ setting + hardness is an an adequate submodel of the non-parallel model rate ~ setting * hardness As in Lecture 6, we use the anova function to compare the two models:
30
12/22/2015330 lecture 1730 > model1<-lm(rate ~ setting + hardness) > model2<-lm(rate ~ setting * hardness) > anova(model1, model2) Analysis of Variance Table Model 1: rate ~ setting + hardness Model 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F) 1 11 95.451 2 9 78.807 2 16.644 0.9504 0.4222 Conclusion: since the F-value is small and the p- value 0.4222 is large, we conclude that the submodel (ie the parallel lines model) is adequate.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.