Download presentation
Presentation is loading. Please wait.
1
F test for Lack of Fit The lack of fit test.
2
Review: use the General Linear Test (GLT) approach to test the slope
Ho: π· π =π versus Ha: π· π β π Full model: π π = π½ 0 + π½ 1 π 1 + π π Under Ha πππΈ πΉ =Ξ£ π π β π π 2 =πππΈ , π π πΉ =πβ2 Reduced model: π π = π½ 0 + π π = π πππππ ππππ + π π Under Ho πππΈ π
=Ξ£ π π β π πππππ ππππ 2 =ππππ, π π π
=πβ1 In previous topic, we have mentioned that ANOVA is an example of the general test for a linear statistical model, or GLT approach. Simply speaking, general linear test approach creates two models, namely full and reduced model. Full model is constructed under Ha, and reduced model is constructed under Ho. The full model usually has more parameters and appear to be a longer model while reduced model has fewer parameters and hence shorter. The full model has more parameters tend to has fewer unexplained variation in Y and be more useful than the reduced model. On the other hand, if the reduction in unexplained variation in Y is not significantly different between the full the model and the reduced model, we donβt support the full model (Ha), and hence do not reject the reduced model (Ho). In ANOVA test on the slope, null hypothesis says beta1 is 0, and the alternative hypothesis says beta1 is not 0. The full model constructed is a model under Ha: Y=beta0+beta1X+random error. The reduced model is constructed under Ho, Y=beta0+random error. The F test statistic computes the proportion of the reduction in in the unexplained variance in YβSSE reduced β SSE fullβ to the total unexplained variance in Y (SSE Full). Note that the actual total variance is divided by the degree of freedom so that the test statistic follows the F distribution, and this is a F test. In this GLT setting, the test statistic, MSR/MSE is actually identical to the ANVOA in Simple linear regression. βSignificant reduction in SSE?β πΉ β = πππΈ π
βπππΈ πΉ π π π
βπ π πΉ πππΈ(πΉ)/π π πΉ = πππ
πππΈ ~ πΉ (1,πβ2) The test statistic of the general linear test in simple linear regression is identical to the ANOVA test statistic.
3
The F test for Lack of Fit
In this topic, we introduce another F test called lack of fit test. The lack of fit test is a formal test for determining whether a specific type of regression function adequately fits the data. We also assume that the observations are independent, normally distributed, and the variance of the random error, sigma squared, are the same across all X levels. Comparing to the F test for the slope, a unique requirement for the lack of fit test is that we need several observations at one or more X levels (called replicates). It donβt mean we need to have replicates for all X levels, but at least for some X levels. We cannot perform the lack of fit test when there is only one Y per X level.
4
The Bank example 11 We now look at the bank example. 11 similar branches of a bank offered gifts for setting up money market accounts, and we are interested in the relationship between specified minimum deposit and number of new accounts opened.
5
In the case where π=ππ: Notation
Minimum deposit Number of new accounts 75 28 42 100 112 136 125 160 150 152 175 156 124 200 104 π 11 denotes the first measurement 28 made at the first X level 75 . π 21 denotes the second measurement 42 made at the first X level 75 . π 1 denotes the average =35 of all y values at the first X level 75 . π 11 denotes the predicted response π 0 + π 1 π=87.5 for the first measurement at the first X level (75). π 21 denotes the predicted response π 0 + π 1 π=87.5 for the second measurement Most of the X levels has two replicates except X=150, there is only one Y=152.
6
The number of X levels is πΆ(=6)
Notation The number of X levels is πΆ(=6)
7
The F test of ANOVA for Ho: π· π =π versus Ha: π· π β π
Q: Does X has significant linear impact on Y? Source of Variation SS π
π MS F Conclusion Regression πππ=πΊ π π β π π 1 MSR= πππ 1 MSR / MSE ~F(1, n-2) Reject Ho means X has significant Linear impact on Y Error πππ=πΊ π π β π π π πβ2 MSE= πππ nβ2 Total ππππ=πΊ π π β π π πβ1 The bank example The F test of ANOVA for the slope that we have seen earlier compute the ratio of MSR to MSE. Rejecting Ho means X has significant linear impact on Y. [S] in this example, F ratio is 3.14 and p value is Since the p value is greater than 0.05, we do not reject Ho, conclude that X has no linear impact on Y. For this question, it means the minimum deposit has no linear impact on the amount of account open. Since the gifts offered is positively related with minimum deposit, we can say that a better gift might not be incentive enough for opening more account. The linear line appears to be more horizontal (beta1 = 0). Do not reject Ho, X has no linear impact on Y There is no evidence to reject π· π =π
8
There is no evidence to reject Ho: π· π =π
X=130 π 3 π 4 So the linear line is rather flat, but, there appears to be more Issues with the fitting. π 5 Low impact, Poor fit π 2 π 6 π 1 Now we have beta1 is not significantly different from 0. Referring to the scatter plot, we see that the line is rather flat (beta1=0). [B] But it appears to be more issue with fitting the data with a linear line, because the relationship does not look linear. When the minimum deposit is less than 130, customers will value the gift reward positively. More accounts are open with more reward (hence more minimum requirement on deposit). But after that, increasing reword (more deposit needed) will no longer bring in new accounts. As a predicting model, the linear model doesnβt seen to be a good choice, on average, the prediction is always either too high or too low than the sample average, with the only one exception being the level 5 X=180. [B] X has low impact on Y, but the model is also a poor fit. Can a low impact model has a good fit? Yes. [B] as shown in the sketch plot. This line has a similar slope, but since the average of each level of X are approximately predictable by the line, it is a good fit for the data. In order to access the actual fitting of the model, we consider the lack-of-fit test. o o Low impact, Can still be a good fit o o o o o o o o o o
9
The lack of fit test π―π: π¬ π (=π)= π· π + π· π πΏ, π―π: π¬ π (=π) β π· π + π· π πΏ
Q: Does the current (linear) model fit (is lack of fit) the data? Full model: π ππ = π π + π ππ Under Ha πππΈ πΉ =Ξ£ Ξ£ π ππ β π π 2 =ππππΈ , π π πΉ =πβπ Reduced model: π ππ = π½ 0 + π½ 1 π+ π ππ Under Ho πππΈ π
=ΣΣ π ππ β π π 2 =πππΈ, π π π
=πβ2 The lack of fit test is a test on whether a model fits the data. If fits, the the means of response variable of each level can be predicted with the linear function. Otherwise, the means of response variable cannot be predicted with such a linear function. Full model is built under Ha, the true mean of Y for each level is denoted by mu, and mu cannot be efficiently predicted with a linear function. The reduced model is the usual linear model. It is interesting to see that the reduced model in the lack of fit test is actually the full model in the GLT test for the slope, df=n-2 If the reduction in SSE is not significantly between the full and reduced model, we will not reject the null hypothesis. The lack of fit test is also a GLT approach. Next we will discuss how variation in partitioned for the lack of fit test. πΉ= πππΈ π
βπππΈ πΉ π π π
βπ π πΉ πππΈ(πΉ)/π π πΉ = πππΈβππππΈ πβ2β(πβπ) ππππΈ πβπ = πππΏπΉ/(πβ2) ππππΈ/ πβπ = πππΏπΉ ππππΈ ~ πΉ (πβ2,πβπ) πΊπΊπ·π¬:ππππ πππππ πππ ππ πππππππ π΄πΊπ·π¬:ππππ πππππ ππππ πππππππ πΊπΊπ³π:ππππ ππ πππ πππ ππ πππππππ π΄πΊπ³π¬:ππππ ππ πππ ππππ πππππππ
10
Partition the variances
In the lack-of-fit test, we partition and study the variance in different model. Recall that in the previous ANOVA, the total variation in Y is partitioned to SSTO, SSE and SSR. SSE is the total variance in Y that cannot be explained by X. πΊ π π β π π = πΊ π π β π π π πΊ π π β π π πΊπΊπ»πΆ = πΊπΊπ¬ πΊπΊπΉ βTotal sum of squaresβ β error sum of squaresβ βregression sum of squaresβ
11
Partition the residual errors for lack of fit
π π β π [B] The lack of fit test takes the SSE and [B] make further partition. In other words, we are trying to find out the reason of the unexplained variation in Y by studying SSE. πππ=πΊ π π β π π π
12
Partition the residual errors for lack of fit
π―π: π¬ π =π = π· π + π· π πΏ βno lack of fitβ π―π: π¬ π (=π) β π· π + π· π πΏ βlack of fitβ π―π: π π = π π π―π: π π β π π βlack of fitβ Use π π β π π to estimate βlack of fitβ In the case of the third X level, X=125, 160 accounts are open in one branch, on average, 155 accounts are open in this level of X. The predicted number of accounts at this level is 112 on the line. The unexplained variation between this case (160) and the predicted value (112) is =48 which is the error deviation. In this deviation of 48, how much are caused by the lack of fit, and how much is cause by random error? As seen on the right, if the model is lack of fit, we would expect the true mean mu deviates from the estimated mean, Yhat. In this case, the lack of fit deviation is difference between sample mean (155) and the predicted mean (112): = 43. The rest of the deviation: 48-43=5, is therefore caused by random error, which can also be computed by difference the between actual observation (160) and itsβ sample mean (155). When we sum up all observations to get the partition: SSE=SS pure error +SS (lack of fit). The degree of freedom can also be partitioned. When there is only simple measurement per X level, then all Y=Ybar, then SSPE = 0. We wonβt be able to tell the difference between SSE and SSLF, and cannot perform lack of fit test. πΊπΊ π ππ β π ππ π =πΊπΊ π ππ β π π π +πΊπΊ π π β π ππ π SSE = SSPE SSLF πβπ = πβπ πβπ When there is only simple measurement in each X level, SSPE = 0
13
ANOVA table Source of Variation SS π
π MS F Conclusion Regression
πππ=πΊπΊ π ππ β π π 1 MSR= πππ 1 MSR /MSE ~F(1, n-2) Reject Ho means X has significant Linear impact on Y Error πππ=πΊ πΊ π ππ β π ππ π πβ2 MSE= πππ nβ2 Lack of fit (in Error) ππππ
=πΊ πΊ π π β π ππ π πβ2 MSLF= πππ³π cβ2 MSLF /MSPE ~F(c-2, n-c) Reject Ho means the current model does not fits the data Pure error (in Error) ππππ=πΊ πΊ π ππ β π π π πβπ MSPE= πππ·π nβc Total ππππ=πΊπΊ π ππ β π π πβ1 The original ANOVA table can be extended to include lack of fit test. SSTO is first partitioned into SSR and SSE. The SSE is further partitioned into SSLF and SSPE. The test statistic compares mean error due to lack of fit and pure random error, and if it is greater than the critical value, we conclude that the lack of fit error is significant and the model does not fit the data.
14
The bank example (n=11, c=6)
Source of Variation SS π
π MS F Conclusion Regression ππππ 1 5141 X (has/ doesnβt has) a significant linear impact on Y Error πππππ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 The current linear model (fit /doesnβt fit) the data Pure error(in Error) ππππ 11-6=5 229.6 Total 19883 10 Now letβs see if the linear model is lack of fit for the bank data.
15
The bank example (n=11, c=6)
Source of Variation SS π
π MS F Conclusion Regression ππππ 1 5141 5141/1638 = 3.14 (p=0.11) X does not has significant linear impact on Y Error πππππ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 3398.5/229.6=14.8 (p=0.0056) The current linear model does not fit the data Pure error(in Error) ππππ 11-6=5 229.6 Total 19883 10 In the bank example, the total sample size is 11 and the number of X levels is 6. We can use the information to confirm the degree of freedom for all Sum of variance. For the first hypothesis on the slope, MSR/MSE is 3.14, and the exact p value is 0.11, we do not reject Ho and conclude that X doesnβt has significant linear impact on Y. For the lack of fit test, the test statistic, MSLF/MSPE is 14.8 and the exact p value is , we reject Ho and conclude that the current linear model not fit the data.
16
The bank example (n=11, c=6)
Source of Variation SS π
π MS F Conclusion Regression ππππ 1 5141 3.14 (p=0.11) X does not has significant linear impact on Y Error πππππ 11-2=9 1638 Lack of fit(in Error) 13594 6-2=4 3398.5 14.8 (p=0.0056) The current linear model does not fit the data Pure error(in Error) ππππ 11-6=5 229.6 Total 19883 10 Build the reduced model under Ho: π = π½ 0 + π½ 1 π Now we demonstrate how to use R to do the lack of fit. First build the reduced model under Ho, which assumes the linear model is a good fit. Hence the reduced model is the same as regular regression model: Yhat=beta0+beta1X Then build the full model under Ha, which assumes the linear model is not a good fit. The full model is then Yhat=mu. [S] In R, this model is to regress Y by each level (each factor) of X. as seen here. Last, the anova function compares the difference between the reduced model and the full model. The lack of fit test is shown on the right. We see that the total SSE is degree of freedom is 9 in the regular regression model (or the reduced model). Then the total lack of fit error is 13594, the pure error is β The F statistic is then computed as and the p value is Build the full model under Ha: π =π MSLF/MSPE= =14.801
17
Practice problem 1: fill in the missing (
Practice problem 1: fill in the missing (??) in the ANOVA table from a SLR Source of Variation SS π
π MS F Conclusion Regression 12.597 ?? Error Lack of fit(in Error) 3 Pure error(in Error) 0.157 Total 15.522 14 Now as a self practice, please try to complete the following two exercise on your own before check out the solution at the end of the talk.
18
Practice problem 2: complete the ANOVA table according to the R output
Source of Variation SS π
π MS F Conclusion Regression Error Lack of fit(in Error) Pure error(in Error) Total
19
SSPE = ΣΣ π ππ β π ππ =0 Solution: grouping
Lack of fit test is not valid when no replication SSPE = ΣΣ π ππ β π ππ =0 Solution: grouping Now that we understand how to do the lack of fit test, letβs talk about the restrictions. Lack of fit test is not valid when there is no replication in every level of X. That is, there is only one observation of Y per X. In the case, the observation Y and the mean of Y are the same value, SSPE=0 For example, suppose there is a data without replicates, there is one Y per X. We see that the SSPE=0, and the F statistic cannot be computed. One solution is to manually create replicates. We group the X into four groups, each group has 3 or 2 values. X of and 40 are now replaced by the average 30, X of 50, 60, 70 are not replaced by the average 60, and so on. We created 4 levels (C=4), the degree of freedom for the SSPE is n-c=11-4=7 and the degree of freedom for SSLF is c-2=4-2=2. The F statistic is and P value is We do not reject the Ho, and conclude that the model has a good fit on the data.
20
Practice problem 1 Solution: fill in the missing (
Practice problem 1 Solution: fill in the missing (??) in the ANOVA table from a SLR Significant impact Lack of fit Solution for self practice question 1.
21
Practice problem 2 solution: complete the ANOVA table according to the R output
Source of Variation SS π
π MS F Conclusion Regression 1 70.21 Significant linear impact Error 77.983 23 3.391 Lack of fit(in Error) 22.749 3 7.583 2.75 No lack of fit Pure error(in Error) 55.234 20 2.762 Total 24 Solution for practice question 2.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.