Presentation is loading. Please wait.

Presentation is loading. Please wait.

F test for Lack of Fit The lack of fit test..

Similar presentations


Presentation on theme: "F test for Lack of Fit The lack of fit test.."β€” Presentation transcript:

1 F test for Lack of Fit The lack of fit test.

2 Review: use the General Linear Test (GLT) approach to test the slope
Ho: 𝜷 𝟏 =𝟎 versus Ha: 𝜷 𝟏 β‰ πŸŽ Full model: π‘Œ 𝑖 = 𝛽 0 + 𝛽 1 𝑋 1 + πœ– 𝑖 Under Ha 𝑆𝑆𝐸 𝐹 =Ξ£ π‘Œ 𝑖 βˆ’ π‘Œ 𝑖 2 =𝑆𝑆𝐸 , 𝑑 𝑓 𝐹 =π‘›βˆ’2 Reduced model: π‘Œ 𝑖 = 𝛽 0 + πœ– 𝑖 = π‘Œ π‘”π‘Ÿπ‘Žπ‘›π‘‘ π‘šπ‘’π‘Žπ‘› + πœ€ 𝑖 Under Ho 𝑆𝑆𝐸 𝑅 =Ξ£ π‘Œ 𝑖 βˆ’ π‘Œ π‘”π‘Ÿπ‘Žπ‘›π‘‘ π‘šπ‘’π‘Žπ‘› 2 =𝑆𝑆𝑇𝑂, 𝑑 𝑓 𝑅 =π‘›βˆ’1 In previous topic, we have mentioned that ANOVA is an example of the general test for a linear statistical model, or GLT approach. Simply speaking, general linear test approach creates two models, namely full and reduced model. Full model is constructed under Ha, and reduced model is constructed under Ho. The full model usually has more parameters and appear to be a longer model while reduced model has fewer parameters and hence shorter. The full model has more parameters tend to has fewer unexplained variation in Y and be more useful than the reduced model. On the other hand, if the reduction in unexplained variation in Y is not significantly different between the full the model and the reduced model, we don’t support the full model (Ha), and hence do not reject the reduced model (Ho). In ANOVA test on the slope, null hypothesis says beta1 is 0, and the alternative hypothesis says beta1 is not 0. The full model constructed is a model under Ha: Y=beta0+beta1X+random error. The reduced model is constructed under Ho, Y=beta0+random error. The F test statistic computes the proportion of the reduction in in the unexplained variance in Yβ€”SSE reduced – SSE fullβ€” to the total unexplained variance in Y (SSE Full). Note that the actual total variance is divided by the degree of freedom so that the test statistic follows the F distribution, and this is a F test. In this GLT setting, the test statistic, MSR/MSE is actually identical to the ANVOA in Simple linear regression. β€œSignificant reduction in SSE?” 𝐹 βˆ— = 𝑆𝑆𝐸 𝑅 βˆ’π‘†π‘†πΈ 𝐹 𝑑 𝑓 𝑅 βˆ’π‘‘ 𝑓 𝐹 𝑆𝑆𝐸(𝐹)/𝑑 𝑓 𝐹 = 𝑀𝑆𝑅 𝑀𝑆𝐸 ~ 𝐹 (1,π‘›βˆ’2) The test statistic of the general linear test in simple linear regression is identical to the ANOVA test statistic.

3 The F test for Lack of Fit
In this topic, we introduce another F test called lack of fit test. The lack of fit test is a formal test for determining whether a specific type of regression function adequately fits the data. We also assume that the observations are independent, normally distributed, and the variance of the random error, sigma squared, are the same across all X levels. Comparing to the F test for the slope, a unique requirement for the lack of fit test is that we need several observations at one or more X levels (called replicates). It don’t mean we need to have replicates for all X levels, but at least for some X levels. We cannot perform the lack of fit test when there is only one Y per X level.

4 The Bank example 11 We now look at the bank example. 11 similar branches of a bank offered gifts for setting up money market accounts, and we are interested in the relationship between specified minimum deposit and number of new accounts opened.

5 In the case where 𝒙=πŸ•πŸ“: Notation
Minimum deposit Number of new accounts 75 28 42 100 112 136 125 160 150 152 175 156 124 200 104 π‘Œ 11 denotes the first measurement 28 made at the first X level 75 . π‘Œ 21 denotes the second measurement 42 made at the first X level 75 . π‘Œ 1 denotes the average =35 of all y values at the first X level 75 . π‘Œ 11 denotes the predicted response 𝑏 0 + 𝑏 1 𝑋=87.5 for the first measurement at the first X level (75). π‘Œ 21 denotes the predicted response 𝑏 0 + 𝑏 1 𝑋=87.5 for the second measurement Most of the X levels has two replicates except X=150, there is only one Y=152.

6 The number of X levels is 𝐢(=6)
Notation The number of X levels is 𝐢(=6)

7 The F test of ANOVA for Ho: 𝜷 𝟏 =𝟎 versus Ha: 𝜷 𝟏 β‰ πŸŽ
Q: Does X has significant linear impact on Y? Source of Variation SS 𝒅𝒇 MS F Conclusion Regression 𝐒𝐒𝐑=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 1 MSR= 𝐒𝐒𝐑 1 MSR / MSE ~F(1, n-2) Reject Ho means X has significant Linear impact on Y Error 𝐒𝐒𝐄=𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 π‘›βˆ’2 MSE= 𝐒𝐒𝐄 nβˆ’2 Total π’π’π“πŽ=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 π‘›βˆ’1 The bank example The F test of ANOVA for the slope that we have seen earlier compute the ratio of MSR to MSE. Rejecting Ho means X has significant linear impact on Y. [S] in this example, F ratio is 3.14 and p value is Since the p value is greater than 0.05, we do not reject Ho, conclude that X has no linear impact on Y. For this question, it means the minimum deposit has no linear impact on the amount of account open. Since the gifts offered is positively related with minimum deposit, we can say that a better gift might not be incentive enough for opening more account. The linear line appears to be more horizontal (beta1 = 0). Do not reject Ho, X has no linear impact on Y There is no evidence to reject 𝜷 𝟏 =𝟎

8 There is no evidence to reject Ho: 𝜷 𝟏 =𝟎
X=130 π‘Œ 3 π‘Œ 4 So the linear line is rather flat, but, there appears to be more Issues with the fitting. π‘Œ 5 Low impact, Poor fit π‘Œ 2 π‘Œ 6 π‘Œ 1 Now we have beta1 is not significantly different from 0. Referring to the scatter plot, we see that the line is rather flat (beta1=0). [B] But it appears to be more issue with fitting the data with a linear line, because the relationship does not look linear. When the minimum deposit is less than 130, customers will value the gift reward positively. More accounts are open with more reward (hence more minimum requirement on deposit). But after that, increasing reword (more deposit needed) will no longer bring in new accounts. As a predicting model, the linear model doesn’t seen to be a good choice, on average, the prediction is always either too high or too low than the sample average, with the only one exception being the level 5 X=180. [B] X has low impact on Y, but the model is also a poor fit. Can a low impact model has a good fit? Yes. [B] as shown in the sketch plot. This line has a similar slope, but since the average of each level of X are approximately predictable by the line, it is a good fit for the data. In order to access the actual fitting of the model, we consider the lack-of-fit test. o o Low impact, Can still be a good fit o o o o o o o o o o

9 The lack of fit test 𝑯𝒐: 𝑬 𝒀 (=𝝁)= 𝜷 𝟎 + 𝜷 𝟏 𝑿, 𝑯𝒂: 𝑬 𝒀 (=𝝁) β‰  𝜷 𝟎 + 𝜷 𝟏 𝑿
Q: Does the current (linear) model fit (is lack of fit) the data? Full model: π‘Œ 𝑖𝑗 = πœ‡ 𝑗 + πœ– 𝑖𝑗 Under Ha 𝑆𝑆𝐸 𝐹 =Ξ£ Ξ£ π‘Œ 𝑖𝑗 βˆ’ π‘Œ 𝑗 2 =𝑆𝑆𝑃𝐸 , 𝑑 𝑓 𝐹 =π‘›βˆ’π‘ Reduced model: π‘Œ 𝑖𝑗 = 𝛽 0 + 𝛽 1 𝑋+ πœ– 𝑖𝑗 Under Ho 𝑆𝑆𝐸 𝑅 =ΣΣ π‘Œ 𝑖𝑗 βˆ’ π‘Œ 𝑗 2 =𝑆𝑆𝐸, 𝑑 𝑓 𝑅 =π‘›βˆ’2 The lack of fit test is a test on whether a model fits the data. If fits, the the means of response variable of each level can be predicted with the linear function. Otherwise, the means of response variable cannot be predicted with such a linear function. Full model is built under Ha, the true mean of Y for each level is denoted by mu, and mu cannot be efficiently predicted with a linear function. The reduced model is the usual linear model. It is interesting to see that the reduced model in the lack of fit test is actually the full model in the GLT test for the slope, df=n-2 If the reduction in SSE is not significantly between the full and reduced model, we will not reject the null hypothesis. The lack of fit test is also a GLT approach. Next we will discuss how variation in partitioned for the lack of fit test. 𝐹= 𝑆𝑆𝐸 𝑅 βˆ’π‘†π‘†πΈ 𝐹 𝑑 𝑓 𝑅 βˆ’π‘‘ 𝑓 𝐹 𝑆𝑆𝐸(𝐹)/𝑑 𝑓 𝐹 = π‘†π‘†πΈβˆ’π‘†π‘†π‘ƒπΈ π‘›βˆ’2βˆ’(π‘›βˆ’π‘) 𝑆𝑆𝑃𝐸 π‘›βˆ’π‘ = 𝑆𝑆𝐿𝐹/(π‘βˆ’2) 𝑆𝑆𝑃𝐸/ π‘›βˆ’π‘ = 𝑀𝑆𝐿𝐹 𝑀𝑆𝑃𝐸 ~ 𝐹 (π‘βˆ’2,π‘›βˆ’π‘) 𝑺𝑺𝑷𝑬:𝒑𝒖𝒓𝒆 𝒆𝒓𝒓𝒐𝒓 π’”π’–π’Ž 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝑴𝑺𝑷𝑬:𝒑𝒖𝒓𝒆 𝒆𝒓𝒓𝒐𝒓 π’Žπ’†π’‚π’ 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝑺𝑺𝑳𝑭:π’π’‚π’„π’Œ 𝒐𝒇 π’‡π’Šπ’• π’”π’–π’Ž 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝑴𝑺𝑳𝑬:π’π’‚π’„π’Œ 𝒐𝒇 π’‡π’Šπ’• π’Žπ’†π’‚π’ 𝒔𝒒𝒖𝒂𝒓𝒆𝒔

10 Partition the variances
In the lack-of-fit test, we partition and study the variance in different model. Recall that in the previous ANOVA, the total variation in Y is partitioned to SSTO, SSE and SSR. SSE is the total variance in Y that cannot be explained by X. 𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 = 𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 𝑺𝑺𝑻𝑢 = 𝑺𝑺𝑬 𝑺𝑺𝑹 β€œTotal sum of squares” β€œ error sum of squares” β€œregression sum of squares”

11 Partition the residual errors for lack of fit
𝒀 π’Š βˆ’ 𝒀 [B] The lack of fit test takes the SSE and [B] make further partition. In other words, we are trying to find out the reason of the unexplained variation in Y by studying SSE. 𝐒𝐒𝐄=𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐

12 Partition the residual errors for lack of fit
𝑯𝒐: 𝑬 𝒀 =𝝁 = 𝜷 𝟎 + 𝜷 𝟏 𝑿 β€œno lack of fit” 𝑯𝒂: 𝑬 𝒀 (=𝝁) β‰  𝜷 𝟎 + 𝜷 𝟏 𝑿 β€œlack of fit” 𝑯𝒐: 𝝁 𝒋 = 𝒀 𝒋 𝑯𝒂: 𝝁 𝒋 β‰  𝒀 𝒋 β€œlack of fit” Use π‘Œ 𝑗 βˆ’ π‘Œ 𝑗 to estimate β€œlack of fit” In the case of the third X level, X=125, 160 accounts are open in one branch, on average, 155 accounts are open in this level of X. The predicted number of accounts at this level is 112 on the line. The unexplained variation between this case (160) and the predicted value (112) is =48 which is the error deviation. In this deviation of 48, how much are caused by the lack of fit, and how much is cause by random error? As seen on the right, if the model is lack of fit, we would expect the true mean mu deviates from the estimated mean, Yhat. In this case, the lack of fit deviation is difference between sample mean (155) and the predicted mean (112): = 43. The rest of the deviation: 48-43=5, is therefore caused by random error, which can also be computed by difference the between actual observation (160) and its’ sample mean (155). When we sum up all observations to get the partition: SSE=SS pure error +SS (lack of fit). The degree of freedom can also be partitioned. When there is only simple measurement per X level, then all Y=Ybar, then SSPE = 0. We won’t be able to tell the difference between SSE and SSLF, and cannot perform lack of fit test. 𝚺𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 π’Šπ’‹ 𝟐 =𝚺𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 𝒋 𝟐 +𝚺𝚺 𝒀 𝒋 βˆ’ 𝒀 π’Šπ’‹ 𝟐 SSE = SSPE SSLF π’βˆ’πŸ = π’βˆ’π’„ π’„βˆ’πŸ When there is only simple measurement in each X level, SSPE = 0

13 ANOVA table Source of Variation SS 𝒅𝒇 MS F Conclusion Regression
𝐒𝐒𝐑=𝚺𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 𝟐 1 MSR= 𝐒𝐒𝐑 1 MSR /MSE ~F(1, n-2) Reject Ho means X has significant Linear impact on Y Error 𝐒𝐒𝐄=𝚺 𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 π’Šπ’‹ 𝟐 π‘›βˆ’2 MSE= 𝐒𝐒𝐄 nβˆ’2 Lack of fit (in Error) 𝐒𝐒𝐋𝐅=𝚺 𝚺 𝒀 𝒋 βˆ’ 𝒀 π’Šπ’‹ 𝟐 π‘βˆ’2 MSLF= 𝐒𝐒𝑳𝑭 cβˆ’2 MSLF /MSPE ~F(c-2, n-c) Reject Ho means the current model does not fits the data Pure error (in Error) 𝐒𝐒𝐏𝐄=𝚺 𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 𝒋 𝟐 π‘›βˆ’π‘ MSPE= 𝐒𝐒𝑷𝑭 nβˆ’c Total π’π’π“πŽ=𝚺𝚺 𝒀 π’Šπ’‹ βˆ’ 𝒀 𝟐 π‘›βˆ’1 The original ANOVA table can be extended to include lack of fit test. SSTO is first partitioned into SSR and SSE. The SSE is further partitioned into SSLF and SSPE. The test statistic compares mean error due to lack of fit and pure random error, and if it is greater than the critical value, we conclude that the lack of fit error is significant and the model does not fit the data.

14 The bank example (n=11, c=6)
Source of Variation SS 𝒅𝒇 MS F Conclusion Regression πŸ“πŸπŸ’πŸ 1 5141 X (has/ doesn’t has) a significant linear impact on Y Error πŸπŸ’πŸ•πŸ’πŸ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 The current linear model (fit /doesn’t fit) the data Pure error(in Error) πŸπŸπŸ’πŸ– 11-6=5 229.6 Total 19883 10 Now let’s see if the linear model is lack of fit for the bank data.

15 The bank example (n=11, c=6)
Source of Variation SS 𝒅𝒇 MS F Conclusion Regression πŸ“πŸπŸ’πŸ 1 5141 5141/1638 = 3.14 (p=0.11) X does not has significant linear impact on Y Error πŸπŸ’πŸ•πŸ’πŸ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 3398.5/229.6=14.8 (p=0.0056) The current linear model does not fit the data Pure error(in Error) πŸπŸπŸ’πŸ– 11-6=5 229.6 Total 19883 10 In the bank example, the total sample size is 11 and the number of X levels is 6. We can use the information to confirm the degree of freedom for all Sum of variance. For the first hypothesis on the slope, MSR/MSE is 3.14, and the exact p value is 0.11, we do not reject Ho and conclude that X doesn’t has significant linear impact on Y. For the lack of fit test, the test statistic, MSLF/MSPE is 14.8 and the exact p value is , we reject Ho and conclude that the current linear model not fit the data.

16 The bank example (n=11, c=6)
Source of Variation SS 𝒅𝒇 MS F Conclusion Regression πŸ“πŸπŸ’πŸ 1 5141 3.14 (p=0.11) X does not has significant linear impact on Y Error πŸπŸ’πŸ•πŸ’πŸ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 14.8 (p=0.0056) The current linear model does not fit the data Pure error(in Error) πŸπŸπŸ’πŸ– 11-6=5 229.6 Total 19883 10 Build the reduced model under Ho: π‘Œ = 𝛽 0 + 𝛽 1 𝑋 Now we demonstrate how to use R to do the lack of fit. First build the reduced model under Ho, which assumes the linear model is a good fit. Hence the reduced model is the same as regular regression model: Yhat=beta0+beta1X Then build the full model under Ha, which assumes the linear model is not a good fit. The full model is then Yhat=mu. [S] In R, this model is to regress Y by each level (each factor) of X. as seen here. Last, the anova function compares the difference between the reduced model and the full model. The lack of fit test is shown on the right. We see that the total SSE is degree of freedom is 9 in the regular regression model (or the reduced model). Then the total lack of fit error is 13594, the pure error is β€˜ The F statistic is then computed as and the p value is Build the full model under Ha: π‘Œ =πœ‡ MSLF/MSPE= =14.801

17 Practice problem 1: fill in the missing (
Practice problem 1: fill in the missing (??) in the ANOVA table from a SLR Source of Variation SS 𝒅𝒇 MS F Conclusion Regression 12.597 ?? Error Lack of fit(in Error) 3 Pure error(in Error) 0.157 Total 15.522 14 Now as a self practice, please try to complete the following two exercise on your own before check out the solution at the end of the talk.

18 Practice problem 2: complete the ANOVA table according to the R output
Source of Variation SS 𝒅𝒇 MS F Conclusion Regression Error Lack of fit(in Error) Pure error(in Error) Total

19 SSPE = ΣΣ π‘Œ 𝑖𝑗 βˆ’ π‘Œ 𝑖𝑗 =0 Solution: grouping
Lack of fit test is not valid when no replication SSPE = ΣΣ π‘Œ 𝑖𝑗 βˆ’ π‘Œ 𝑖𝑗 =0 Solution: grouping Now that we understand how to do the lack of fit test, let’s talk about the restrictions. Lack of fit test is not valid when there is no replication in every level of X. That is, there is only one observation of Y per X. In the case, the observation Y and the mean of Y are the same value, SSPE=0 For example, suppose there is a data without replicates, there is one Y per X. We see that the SSPE=0, and the F statistic cannot be computed. One solution is to manually create replicates. We group the X into four groups, each group has 3 or 2 values. X of and 40 are now replaced by the average 30, X of 50, 60, 70 are not replaced by the average 60, and so on. We created 4 levels (C=4), the degree of freedom for the SSPE is n-c=11-4=7 and the degree of freedom for SSLF is c-2=4-2=2. The F statistic is and P value is We do not reject the Ho, and conclude that the model has a good fit on the data.

20 Practice problem 1 Solution: fill in the missing (
Practice problem 1 Solution: fill in the missing (??) in the ANOVA table from a SLR Significant impact Lack of fit Solution for self practice question 1.

21 Practice problem 2 solution: complete the ANOVA table according to the R output
Source of Variation SS 𝒅𝒇 MS F Conclusion Regression 1 70.21 Significant linear impact Error 77.983 23 3.391 Lack of fit(in Error) 22.749 3 7.583 2.75 No lack of fit Pure error(in Error) 55.234 20 2.762 Total 24 Solution for practice question 2.


Download ppt "F test for Lack of Fit The lack of fit test.."

Similar presentations


Ads by Google