Topic 23: Diagnostics and Remedies
Outline Diagnostics –residual checks ANOVA remedial measures
Diagnostics Overview We will take the diagnostics and remedial measures that we learned for regression and adapt them to the ANOVA setting Many things are essentially the same Some things require modification
Residuals Predicted values are cell means, = Residuals are the differences between the observed values and the cell means Y ij -
Basic plots Plot the data vs the factor levels (the values of the explanatory variables) Plot the residuals vs the factor levels Construct a normal quantile plot and/or histogram of the residuals
KNNL Example KNNL p 777 Compare 4 brands of rust inhibitor (X has r=4 levels) Response variable is a measure of the effectiveness of the inhibitor There are 10 units per brand (n=10)
Plots Data versus the factor Residuals versus the factor Normal quantile plot of the residuals
Plots vs the factor symbol1 v=circle i=none; proc gplot data=a2; plot (eff resid)*abrand; run;
Data vs the factor Means look different …common spread in Y’s
Residuals vs the factor Odd dist of points
QQ-plot Due to odd (lack of and large)spread Can try nonparametric analysis – last slides
General Summary Look for –Outliers –Variance that depends on level –Non-normal errors Plot residuals vs time and other variables if available
Homogeneity tests Homogeneity of variance (homoscedasticity) H 0 : σ 1 2 = σ 2 2 = … = σ r 2 H 1 : not all σ i 2 are equal Several significance tests are available
Homogeneity tests Text discusses Hartley, modified Levene SAS has several including Bartlett’s (essentially the likelihood ratio test) and several versions of Levene
Homogeneity tests There is a problem with assumptions –ANOVA is robust with respect to moderate deviations from Normality –ANOVA results can be sensitive to the homogeneity of variance assumption Some homogeneity tests are sensitive to the Normality assumption
Levene’s Test Do ANOVA on the squared residuals from the original ANOVA Modified Levene’s test uses absolute values of the residuals Modified Levene’s test is recommended Another quick and dirty rule of thumb
KNNL Example KNNL p 785 Compare the strengths of 5 types of solder flux (X has r=5 levels) Response variable is the pull strength, force in pounds required to break the joint There are 8 solder joints per flux (n=8)
Scatterplot
Levene’s Test proc glm data=a1; class type; model strength=type; means type/ hovtest=levene(type=abs); run;
ANOVA Table SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total Common variance estimated to be 2.11
Output Levene's Test ANOVA of Absolute Deviations Source DF F Value Pr > F type Error 35 We reject the null hypothesis and assume nonconstant variance
Means and SDs Level strength type N Mean Std Dev
Remedies Delete outliers – Is their removal important? Use weights (weighted regression) Transformations Nonparametric procedures
What to do here? Not really any obvious outliers Do not see pattern of increasing or decreasing variance or skewed dists Will consider –Weighted ANOVA –Mixed model ANOVA
Weighted least squares We used this with regression –Obtain model for how the sd depends on the explanatory variable (plotted absolute value of residual vs x) –Then used weights inversely proportional to the estimated variance
Weighted Least Squares Here we can compute the variance for each level Use these as weights in PROC GLM We will illustrate with the soldering example from KNNL
Obtain the variances and weights proc means data=a1; var strength; by type; output out=a2 var=s2; data a2; set a2; wt=1/s2; NOTE. Data set a2 has 5 cases
Proc Means Output Level of typeN strength MeanStd Dev
Merge and then use the weights in PROC GLM data a3; merge a1 a2; by type; proc glm data=a3; class type; model strength=type; weight wt; lsmeans type / cl; run;
Output SourceDF Sum of SquaresMean SquareF ValuePr > F Model <.0001 Error Corrected Total Data have been standardized to have a variance of 1
LSMEANS Output type strength LSMEAN Standard ErrorPr > |t| 95% Confidence Limits < < < < < Because of weights, standard errors simply based on sample variances of each level
Mixed Model ANOVA Relax the assumption of constant variance rather than including a “known” weight This involves moving to a mixed model procedure Topic will not be on exam but wanted you to be aware of these model capabilities
SAS Code proc glimmix data=a1; class type; model strength=type / ddfm=kr; random residual / group=type; run; This allows the variance to differ in each level and a degrees of freedom adjustment is used to account for this
GLIMMIX OUTPUT Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Generalized Chi-Square35.00 Gener. Chi-Square / DF1.00 Covariance Parameter Estimates Cov ParmGroupEstimate Standard Error Residual (VC)type Residual (VC)type Residual (VC)type Residual (VC)type Residual (VC)type Type III Tests of Fixed Effects Effect Num DF Den DFF ValuePr > F type <.0001 Really 3 groups of variances
SAS Code proc glimmix data=a1; class type; model strength=type / ddfm=kr; random residual / group=type1; run; Type1 was created to identify Type 1 and 2, Type 3, and Type 4 and 5 as 3 groups
GLIMMIX OUTPUT Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Generalized Chi-Square35.00 Gener. Chi-Square / DF1.00 Covariance Parameter Estimates Cov ParmGroupEstimate Standard Error Residual (VC)Grp Residual (VC)Grp Residual (VC)Grp Type III Tests of Fixed Effects Effect Num DF Den DFF ValuePr > F type <.0001 Better BIC but same general type conclusion
Transformation Guides When σ i 2 is proportional to μ i, use When σ i is proportional to μ i, use log(y) When σ i is proportional to μ i 2, use 1/y For proportions, use arcsin( ) –arsin(sqrt(y)) in a SAS data step Box-Cox transformation
Example Consider study on KNNL pg 790 Y: time between computer failures X: three locations data a3; infile 'u:\.www\datasets512\CH18TA05.txt'; input time location interval; symbol1 v=circle; proc gplot; plot time*location; run;
Scatterplot Outlier or skewed distribution? Can consider transformation first
Box-Cox Transformation Can consider regression and 1-b 1 is the power to raise Y by Can try various “convenient” powers Can use SAS directly to calculate the power
E(logsig) = logmu Power should be ≈ 0.20
Using SAS proc transreg data=a3; model boxcox(time / lambda=-2 to 2 by.2) = class(location); run;
Output Box-Cox Transformation Information for time LambdaR-SquareLog Like * * < * < - Best Lambda * - 95% Confidence Interval + - Convenient Lambda
Transforming data in SAS data a3; set a3; transtime = time**0.20; symbol1 v=circle i=none; proc gplot; plot transtime*location; run;
Much more constant spread in data!
Nonparametric approach Based on ranks See KNNL section 18.7, p 795 See the SAS procedure NPAR1WAY
Rust Inhibitor Analysis SourceDF Sum of SquaresMean SquareF ValuePr > F Model <.0001 Error Corrected Total Highly significant F test. Even if there is a violation of Normality, the evidence is overwhelming
Nonparametric Analysis Wilcoxon Scores (Rank Sums) for Variable eff Classified by Variable abrand abrandN Sum of Scores Expected Under H0 Std Dev Under H0 Mean Score Average scores were used for ties. Kruskal-Wallis Test Chi-Square DF3 Pr > Chi-Square<.0001
Last slide We’ve finished most of Chapters 17 and 18. We used program topic23.sas to generate the output.