Topic 7: Analysis of Variance
Outline Partitioning sums of squares Breakdown degrees of freedom Expected mean squares (EMS) F test ANOVA table General linear test Pearson Correlation / R 2
Analysis of Variance Organize results arithmetically Total sum of squares in Y is Partition this into two sources –Model (explained by regression) –Error (unexplained / residual)
Total Sum of Squares MST is the usual estimate of the variance of Y if there are no explanatory variables SAS uses the term Corrected Total for this source Uncorrected is ΣY i 2 The “corrected” means that we subtract of the mean before squaring
Model Sum of Squares df R = 1 (due to the addition of the slope) MSR = SSR/df R KNNL uses regression for what SAS calls model So SSR (KNNL) is the same as SS Model
Error Sum of Squares df E = n-2 (both slope and intercept) MSE = SSE/df E MSE is an estimate of the variance of Y taking into account (or conditioning on) the explanatory variable(s) MSE=s 2
ANOVA Table Source df SS MS Regression 1 SSR/df R Error n-2 SSE/df E ________________________________ Total n-1 SSTO/df T
Expected Mean Squares MSR, MSE are random variables When H 0 : β 1 = 0 is true E(MSR) =E(MSE)
F test F*=MSR/MSE ~ F(df R, df E ) = F(1, n-2) See KNNL pgs When H 0 : β 1 =0 is false, MSR tends to be larger than MSE We reject H 0 when F is large If F* F(1-α, df R, df E ) = F(.95, 1, n-2) In practice we use P-values
F test When H 0 : β 1 =0 is false, F has a noncentral F distribution This can be used to calculate power Recall t* = b 1 /s(b 1 ) tests H 0 : β 1 =0 It can be shown that (t*) 2 = F* (pg 71) Two approaches give same P-value
ANOVA Table Source df SS MS F P Model 1 SSM MSM MSM/MSE 0.## Error n-2 SSE MSE Total n-1 **Note: Model instead of Regression used here. More similar to SAS
Examples Tower of Pisa study (n=13 cases) proc reg data=a1; model lean=year; run; Toluca lot size study (n=25 cases) proc reg data=toluca; model hours=lotsize; run;
Pisa Output Number of Observations Read13 Number of Observations Used13 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total
Pisa Output Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept year <.0001 (30.07) 2 =904.2 (rounding error)
Toluca Output Number of Observations Read25 Number of Observations Used25 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total
Toluca Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept lotsize <.0001 Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var (10.29) 2 =105.88
General Linear Test A different view of the same problem We want to compare two models –Y i = β 0 + β 1 X i + e i (full model) –Y i = β 0 + e i (reduced model) Compare two models using the error sum of squares…better model will have “smaller” mean square error
General Linear Test Let SSE(F) = SSE for full model SSE(R) = SSE for reduced model Compare with F(1-α,df R -df F,df F )
Simple Linear Regression df R =n-1, df F =n-2, df R -df F =1 F=(SSTO-SSE)/MSE=SSR/MSE Same test as before This approach is more general
Pearson Correlation r is the usual correlation coefficient It is a number between –1 and +1 and measures the strength of the linear relationship between two variables
Pearson Correlation Notice that Test H 0 : β 1 =0 similar to H 0 : ρ=0
R 2 and r 2 Ratio of explained and total variation
R 2 and r 2 We use R 2 when the number of explanatory variables is arbitrary (simple and multiple regression) r 2 =R 2 only for simple regression R 2 is often multiplied by 100 and thereby expressed as a percent
R 2 and r 2 R 2 always increases when additional explanatory variables are added to the model Adjusted R 2 “penalizes” larger models Doesn’t necessarily get larger
Pisa Output Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total R-Square (SAS) = SSM/SSTO = 15804/15997 =
Toluca Output R-Square (SAS) = SSM/SSTO = / = Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total
Background Reading May find 2.10 and 2.11 interesting 2.10 provides cautionary remarks –Will discuss these as they arise 2.11 discusses bivariate Normal dist –Similarities and differences –Confidence interval for r Program topic7.sas has the code to generate the ANOVA output Read Chapter 3