Presentation is loading. Please wait.

Presentation is loading. Please wait.

The basic idea of Analysis of Variance (ANOVA) approach to Regression analysis In the previous topics, we have developed the basic regression model.

Similar presentations


Presentation on theme: "The basic idea of Analysis of Variance (ANOVA) approach to Regression analysis In the previous topics, we have developed the basic regression model."β€” Presentation transcript:

1 The basic idea of Analysis of Variance (ANOVA) approach to Regression analysis
In the previous topics, we have developed the basic regression model and demonstrated how to make prediction. In this topic, we will view the regression analysis from the perspective of analysis of variance. The new perspective will not enable us to do anything new, but the analysis of variance approach will come into its own when we take up multiple regression models and other types of linear statistical models.

2 Partitioning variance in the total sum of squares
𝒀 π’Š βˆ’ 𝒀 The analysis of variance approach is based on the partitioning of the total sums of squares and degrees of freedom associated with the response variable Y. These deviations are shown by the vertical lines in figure (a). The total variation, denoted by SSTO, is the sum of the squared deviations. [S] SSTO is the sum of deviations from the mean. SSTO stands for total sum of squares. If all Y observations are the same, SSTO=0. The greater the variation among Yi, the larger is SSTO. Thus SSTO is a measure of uncertainty in Y. When we utilize the predictor variable X, the variation reflecting the uncertainty in Y is that of the Y observations around the fitted regression line: Y-Yhat. These deviations are shown by the vertical lines in figure (b). The measure of variation in Y when X is taken into account is the sum of the squared deviations, which is the familiar SSE. [S] SSE denotes the error sum of squares. If all Y observations fall on the fitted regression line, SSE=0. This means variation in Y can be perfectly explained by X. The greater the variation of the Y observations around the fitted regression line, the larger is SSE. The difference between SSTO and SSE is another sum of squares, SSR, where SSR stands for regression sum of squares. These deviations are shown by the vertical lines in figure (c). Each deviation is simply the difference between the fitted value of the regression line and the mean of the fitted values Ybar. If the regression line is horizontal so that Ybar=Y bar. Then SSR=0. Otherwise SSR is positive. SSR may be considered the part of variation in Y associated with the regression line. The larger SSR is in relation to SSTO, the more variation in Y is explained by the regression model, and the better model it is. [B] SSR is sometime denoted by SSM, which stands for sum of squared of the model. 𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 = 𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 𝑺𝑺𝑻𝑢 = 𝑺𝑺𝑬 𝑺𝑺𝑹 Also known as SSM (model) β€œTotal sum of squares” β€œ error sum of squares” β€œregression sum of squares”

3 Partitioning Degree of freedom
𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 =𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 𝑺𝑺𝑻𝑢 = 𝑺𝑺𝑬 𝑺𝑺𝑹 Degree of freedom π‘›βˆ’ = π‘›βˆ’ 1 Corresponding to the portioning of the total sum of square SSTO, there is a partitioning of the associated degrees of freedom (df) SSTO has n-1 degree of freedom since the sum Yi-Ybar is 0, and one df is lost because the sample mean is used to estimate the population mean. SSE has n-2 df. Two df are lost because the two parameters, intercept and slope are estimated in the regression function. SSR has 1 df. In the two df associated with the regression line, one df is lost because the the deviation Ξ£ π‘Œ βˆ’ π‘Œ =0

4 Degree of freedom of error (an intuitive flavor)
Q: what is the minimum requirement on data points to estimate this regression? π‘Œ= 𝛽 0 + 𝛽 1 𝑋+πœ–, and πœ–=π‘Œβˆ’ 𝛽 0 βˆ’ 𝛽 1 𝑋 𝑛=2, 𝑑𝑓𝐸=0 𝑛=3, 𝑑𝑓𝐸=1 So, 𝑑𝑓𝐸=π‘›βˆ’2=π‘›βˆ’π‘βˆ’1 Where 𝑝 is the number of predictors (𝑝=1 𝑖𝑛 π‘‘β„Žπ‘–π‘  π‘π‘Žπ‘ π‘’) Now look at the concept of degree of freedom in an intuitive flavor. The way we will start is by looking at a simple linear regression with one independent variable X, and one dependent variable Y. I will ask you what is the minimum requirement on data points to estimate this regression? For example, say Y is the person’s height, and we try to estimate the person’s height via their weight, x. You can expect someone who weights more might be taller, but it is not obviously going to be a one-to-one relationship. There is going to be some error associated with it. A regression is essentially drawing a line of best fit through all of the observations But how many people do you need in this study to make a regression? So here we have X and Y. now if you think you have one observation so one person in the study you cannot run a regression. You cannot draws a line of best fit through one point. So you think, alright, we need two observations for our regression. You see that both of those points are used to build the regression line. The fact that all points are on the line doesn’t mean the line is a best fit, in fact, the relationship between X and Y cannot be accessed when all observations are used to build the model. It is only after we get that third observation that the model gain some freedom to access the regression model. The idea is that we have one degree of freedom because we have that third observation which allows the model to actually different from the points itself. So if we throw in another observation in the mix there we actually now have two degree of freedom because there’s two additional observations giving the model a bit more flexibility. In general, if we have n observations in the model, 2 of them will be used to build the linear model, and the rest n-2 will be the degree of freedom left to access the model, i.e., for accessing the random errors.

5 Degree of freedom of error (an intuitive flavor)
Q: what is the minimum requirement on data points to estimate this regression? What is the degree of freedom left? π‘Œ= 𝛽 0 + 𝛽 1 𝑋 1 + 𝛽 2 𝑋 2 +πœ–, and πœ–=π‘Œβˆ’ 𝛽 0 βˆ’ 𝛽 1 𝑋 1 βˆ’ 𝛽 2 𝑋 2 𝑛=3, 𝑑𝑓𝐸=0 𝑛=4, 𝑑𝑓𝐸=1 So, 𝑑𝑓𝐸=π‘›βˆ’3 =π‘›βˆ’π‘βˆ’1 Where 𝑝 is the number of predictors (𝑝=2 𝑖𝑛 π‘‘β„Žπ‘–π‘  π‘π‘Žπ‘ π‘’) Now suppose I have thrown in a second variable, X2. So let’s say Y is the height, X1 is the weight, and X2 is the person’s mother’s height. In this case, what is the minimum requirement on data points to estimate this regression? Visually we can say it is represented a sort of three-dimensional space, we have X1 on the horizontal axis here, and X2 on this kind of axis coming out of the page if you will. And Y is going up. Let is start with 3. Unlike the two dimensional equivalent, here we are actually drawing a plane of best fit when you have two X variables. Essentially what you are doing is putting a plane through those points and appreciate that any three points in three dimensional space can have a plane cut through all three of them. When we introduce the forth point, the plane can gets some freedom now to cut through those four points, meaning we have a degrees of freedom of 1. So a three dimensional model, we need at least 3 points to build the model, and 4 points to get one degree of freedom. And if have 5 points we have 2 degree of freedom. So degree of freedom with 2 X variables will be n-3. With the additional X variable in the model, we lost some degree of freedom from building the model. This leads to the formula of dfE=n-p-1, where, n is the number of observations, p is the number of X variables.

6 The Analysis of Variance (ANOVA):
Source of Variation SS 𝒅𝒇 MS E{MS} Regression 𝐒𝐒𝐑=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 = 𝒃 𝟏 𝟐 𝚺 𝐗 𝐒 βˆ’ 𝑿 𝟐 . 1 MSR= 𝐒𝐒𝐑 1 𝝈 𝟐 + 𝜷 𝟏 𝟐 𝑿 π’Š βˆ’ 𝑿 𝟐 Error 𝐒𝐒𝐄=𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 π‘›βˆ’2 MSE= 𝐒𝐒𝐄 nβˆ’2 𝝈 𝟐 Total π’π’π“πŽ=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 π‘›βˆ’1 The breakdowns of the total sum of squares and associated degrees of freedom are displayed in the form of an analysis of variance table (ANOVA table) The SSR can also be computed as b1^2 SSX. Mean squares of interest also are shown as the SSR divided by the degree of freedom. In addition, this ANOVA table contains a column of expected mean squares. The mean of the sampling distribution of MSE, E{MSE}= 𝜎 2 , MSE is an unbiased estimation for the random error, whether or not X and Y are linearly related. The mean of the sampling distribution of MSR is 𝜎 2 , when 𝛽 1 =0, this happens when the linear regression line is flat, and overlap with the horizontal line Ybar. Otherwise, MSR will tend to be larger than MSE. Tis suggests that a comparison of MSR and MSE is useful for testing whether or not beta1=0. If MSR and MSE are about the same, beta1 should 0 and the regression line is flat. On the other hand, if MSR is substantially greater than MSE, beta1 should not be 0, the regression line is not flat. This indeed is the basic idea underlying the ANOVA, name the F test. Comments: E{MSE}= 𝝈 𝟐 , MSE is an unbiased estimation for the random error. E{MSR}> 𝝈 𝟐 , when 𝜷 𝟏 β‰ πŸŽ. Hence MSR will tend to be larger than MSE.

7 The F test of Ho: 𝛽 1 =0 versus Ha: 𝛽 1 β‰ 0
Source of Variation SS 𝒅𝒇 MS E{MS} Regression 𝐒𝐒𝐑=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 1 MSR= 𝐒𝐒𝐑 1 𝝈 𝟐 + 𝜷 𝟏 𝟐 𝑿 π’Š βˆ’ 𝑿 𝟐 Error 𝐒𝐒𝐄=𝚺 𝒀 π’Š βˆ’ 𝒀 π’Š 𝟐 π‘›βˆ’2 MSE= 𝐒𝐒𝐄 nβˆ’2 𝝈 𝟐 Total π’π’π“πŽ=𝚺 𝒀 π’Š βˆ’ 𝒀 𝟐 π‘›βˆ’1 The test statistic for analysis of variance approach is denoted by F* or Fs. Fs is the ration of MSE to MSE, and follows a F distribution, with the degree of freedom 1, n-2. 𝑭 βˆ— 𝑭 𝒔 = 𝑴𝑺𝑹 𝑴𝑺𝑬 ~ 𝑭 (𝟏,π’βˆ’πŸ) Reject Ho if 𝑭 βˆ— > 𝑭 (πŸβˆ’πœΆ;𝟏,π’βˆ’πŸ)

8 Example1 Ho: 𝛽 1 =0 versus Ha: 𝛽 1 β‰ 0
Source of Variation SS 𝒅𝒇 MS F Regression 252378 1 Error 54825 23 Total 307203 24 𝑭 βˆ— = 𝑴𝑺𝑹 𝑴𝑺𝑬 = ~ 𝑭 ( , ) For example, suppose an ANOVA table is given, At level, reject Ho if 𝑭 βˆ— > 𝑭 (𝟎.πŸ—πŸ“;𝟏,πŸπŸ‘) Conclude that there is a linear association between X and Y.

9 Example1 Ho: 𝛽 1 =0 versus Ha: 𝛽 1 β‰ 0
Source of Variation SS 𝒅𝒇 MS F Regression 252378 1 Error 54825 23 Total 307203 24 252378/1=252378 232378/2384=105.88 54825/23=2384 𝑭 βˆ— = 𝑴𝑺𝑹 𝑴𝑺𝑬 =πŸπŸŽπŸ“.πŸ–πŸ–~ 𝑭 (𝟏,πŸπŸ‘) For example, suppose we want to test a two sided test for whether beta1 equals to 0. And a partial ANOVA table is given. Reject Ho if 𝑭 βˆ— > 𝑭 (𝟎.πŸ—πŸ“;𝟏,πŸπŸ‘) Conclude that there is a linear association between X and Y.

10 𝑭 βˆ— =πŸπŸŽπŸ“.πŸ–πŸ–~ 𝑭 (𝟏,πŸπŸ‘) Method 1: use R Reject Ho if
𝑭 βˆ— > 𝑭 (𝟎.πŸ—πŸ“;𝟏,πŸπŸ‘)= 4.28 Method 2: use F table Or, reject Ho if 𝑭 βˆ— > 𝑭 (𝟎.πŸ—πŸ“;𝟏,𝟐𝟎)= 4.35 The estimated π‘·βˆ’π’—π’‚π’π’–π’†<𝟎.𝟎𝟎𝟏, since 𝑭 βˆ— >𝑭 𝟎.πŸ—πŸ—πŸ—;𝟏,𝟐𝟎 =πŸπŸ’.πŸ–πŸ Now let’s find the critical value F. Using the qf function, we compute the exact critical value to be 4.28.

11 Example2: An investigative study collected 16 observations from random locations in the Wabash river, count the amount of particulates from each location (X) and number of fish (Y). Complete the following ANOVA table for the regression analysis. Dose the result suggest that the amount of particulates of the water (in different location) affect the number of fish? Source of Variation SS 𝒅𝒇 MS F Regression 4.5 Error Total 85.2 𝑭 βˆ— = 𝑴𝑺𝑹 𝑴𝑺𝑬 =~ 𝑭( ) What is the F(_______) if use the F table? Reject Ho if 𝐹 βˆ— >_______(𝑒𝑠𝑒 𝑅) or ______(use the F table) Conclude that there ______(is/isn’t) a linear association between work hours and lot size.

12 Example2: An investigative study collected 16 observations from random locations in the Wabash river, count the amount of particulates from each location (X) and number of fish (Y). Complete the following ANOVA table for the regression analysis. Dose the result suggest that the amount of particulates of the water (in different location) affect the number of fish? Source of Variation SS 𝒅𝒇 MS F Regression 4.5 1 0.781 Error 80.7 14 5.76 Total 85.2 15 𝑭 βˆ— = 𝑴𝑺𝑹 𝑴𝑺𝑬 =~ 𝑭 (𝟏,πŸπŸ’) or 𝑭 𝟏,𝟏𝟐 with the F table Reject Ho if 𝑭 βˆ— >πŸ’.πŸ” or 4.75 with the F table Conclude that there isn’t a linear association between amount of particulates of the water and fish.

13 Equivalence of a two-sided F test (ANOVA) and t test (SLR)
π»π‘œ: 𝛽 1 = π»π‘Ž: 𝛽 1 β‰ 0 The T test statistic 𝑑 βˆ— = 𝑏 1 𝑠 𝑏 1 ~𝑑(π‘›βˆ’2) The F test statistic 𝐹 βˆ— = 𝑀𝑆𝑅 𝑀𝑆𝐸 ~ 𝐹 (1,π‘›βˆ’2) 𝐹 βˆ— = 𝑀𝑆𝑅 𝑀𝑆𝐸 = 𝑏 1 2 Ξ£ 𝑋 𝑖 βˆ’ 𝑋 2 𝑀𝑆𝐸 = 𝑏 𝑠 2 𝑏 1 = 𝑑 βˆ— 2 𝑆𝑖𝑛𝑐𝑒 𝑠 2 𝑏 1 =𝑀𝑆𝐸/Ξ£ 𝑋 𝑖 βˆ’ 𝑋 2 For a given alpha level, the F test for beta1 is equivalent algebraically to the two-tailed t test. To see this, recall that the t test statistic is b1 over standard error b1. The F statistic is MSR/MSE where MSR=b1^2 times SSX, with some rearrangements, we can appreciate that F statistic is actually the square of the t statistic. The critical values are also equivalent in the sense that t^2 =F For example, at alpha of 0.05 level, df3=23, the t critical value of 2.069, and we square this value to get 4.28 which is the F critical value. Thus, at any given significant level, we can use either the t test or the F test for testing beta1. Note that F test and T test are only equivalent in the SLR test [S] and two sided test [S]. The T test and F tests are equivalent in SLR 𝐹 βˆ— = 𝑑 βˆ— 2 for two sided test, the critical values: 𝑑 1βˆ’ 𝛼 2 ,π‘›βˆ’2 2 =𝐹(1βˆ’π›Ό;1, π‘›βˆ’2) For example, at 𝛼=0.05, 𝑑𝑓𝑒=23: 𝑑 ;23 2 = =4.28=𝐹 0.95, 1, 23

14 F Test (ANOVA) and T test are not always equivalent
1. They are equivalent in simple linear regression (SLR) problem, and will not be so for Multiple regression. 2. They are equivalent when H0 : Ξ²1 = 0. Ho: 𝛽 1 = 𝛽 1 βˆ— (β‰ 0) can be tested with a t-test. In Ho: 𝛽 1 = 𝛽 1 βˆ— β‰ 0 , the test statistic Fβˆ— has a non-central F distribution, and require extra steps and not covered in the course. 3. In SLR, the T test is more flexible and more commonly used than the F test. We will continue to compare them in MLR. F test and t test are now always equivalent. They are equivalent in simple linear regression (SLR) problem, and will not be so for Multiple regression. They are equivalent when the null hypothesis is to compare beta 1 to 0. When the null hypothesis is to compare beta1 with a nonzero constant, the t test is still applicable. But the F distribution is now a more complicated non-central distribution and require extra steps. In SLR, the T test is more flexible and more commonly used than the F test. We will continue to compare them in MLR.

15 Introduce the General Linear Test (GLT) approach
Ho: 𝜷 𝟏 =𝟎 versus Ha: 𝜷 𝟏 β‰ πŸŽ Full model: π‘Œ 𝑖 = 𝛽 0 + 𝛽 1 𝑋 1 + πœ– 𝑖 Under Ha 𝑆𝑆𝐸 𝐹 =Ξ£ π‘Œ 𝑖 βˆ’ π‘Œ 𝑖 2 =𝑆𝑆𝐸 , 𝑑 𝑓 𝐹 =π‘›βˆ’2 Reduced model: π‘Œ 𝑖 = 𝛽 0 + πœ– 𝑖 Under Ho 𝑆𝑆𝐸 𝑅 =Ξ£ π‘Œ 𝑖 βˆ’ π‘Œ 𝑖 2 =𝑆𝑆𝑇𝑂, 𝑑 𝑓 𝑅 =π‘›βˆ’1 ANOVA is an example of the general test for a linear statistical model, or GLT approach. This approach is very useful in building complicated regression model. The idea is easily to understand here in the context of simple linear regression. We begin with the model considered to be appropriate for the data, the full or unrestricted model. In simple linear regression model, the full model has all the parameters, Y=beta0+beta1X+error. The full model is a model under Ha. Under the full model, the total error sum of squares is SSE the df is n-2, as before. [S] The F in notation SSE(F) means β€œfull”. The reduced model has a reduced set of parameters. Under Ho, beta1=0, and the full model is reduced as Y=beta0+error. Under the reduced model, since beta1=0, the regression line is the horizontal line and Yhat=Ybar. It can be shown that the b0=Ybar, and the error sum of squares for this model is SSE(SSR)=sum of (Y-Ybar)^2, and the df is n-1. [S] the R in the notation SSE(R) means β€œreduced”. The logic now is to compare the two error sums of squares SSE(F) and SSE(R), If more difference in SSE(F) and SSE(R), the more likely that the full model is different from the reduced model, and suggests that Ha holds true. [B] In other words, if the full model has a much smaller residual error SSE(F) than the reduced model, we say that the full model is a much better model to reduce the random error and X helps to explain a significant amount of variation in Y. The actual test statist is a function of difference [S] SSE(R)-SSE(F). Which follows the F distribution if Ho is true. We can appreciate that the test statistic in the general linear test is actually identical to the ANVOA in Simple linear regression. β€œSignificant reduction in SSE?” 𝐹 βˆ— = 𝑆𝑆𝐸 𝑅 βˆ’π‘†π‘†πΈ 𝐹 𝑑 𝑓 𝑅 βˆ’π‘‘ 𝑓 𝐹 𝑆𝑆𝐸(𝐹)/𝑑 𝑓 𝐹 = 𝑀𝑆𝑅 𝑀𝑆𝐸 ~ 𝐹 (1,π‘›βˆ’2) The test statistic of the general linear test in simple linear regression is identical to the ANOVA test statistic.

16 General Linear Test can be extended to multiple parameters ( 𝜷 𝟏 , 𝜷 𝟐 , …)
Given the number of additional parameters in the the full (more complex) model compared to the reduced model, does the full model yield a larger reduction in SSE than we would expect to get by adding a similar number of unrelated (i.e., useless) predictor variables? 𝐹 βˆ— = 𝑆𝑆𝐸 𝑅 βˆ’π‘†π‘†πΈ 𝐹 𝑑 𝑓 𝑅 βˆ’π‘‘ 𝑓 𝐹 𝑆𝑆𝐸(𝐹)/𝑑 𝑓 𝐹 = 𝑀𝑆𝑅 𝑀𝑆𝐸 ~ 𝐹 (𝑑 𝑓 𝑅 βˆ’π‘‘ 𝑓 𝐹 ,𝑑 𝑓 𝐹 ) General linear test can be easily extended to access multiple parameters beta1, beta2, … We will see it again in multiple linear regression. The GLT is a very general tool. We will see it again in Multiple Linear Regression.

17 Summary The basic idea of ANOVA is to study the source and proportion of variance in data F test (ANOVA) and T test (Simple Linear model) are not always equivalent ANOVA test extends to General Linear test/model (GLM/GLT) Summary, in this topic, we talk about basic idea of ANOVA, compare F and T test, and finally introduce the idea of general linear test namely GLT, which is also referred to as GLM, general linear model.


Download ppt "The basic idea of Analysis of Variance (ANOVA) approach to Regression analysis In the previous topics, we have developed the basic regression model."

Similar presentations


Ads by Google