Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1
Chapter 14: ANALYSIS OF VARIANCE Part 2 SAS ESSENTIALS -- Elliott & Woodward2
LEARNING OBJECTIVES SAS ESSENTIALS -- Elliott & Woodward3 To be able to use SAS® procedures to perform analysis of covariance To be able to use SAS procedures to perform two-factor AN OVA using PROC GLM To be able to perform two-factor ANOVA using PROC MIXED To be able to use SAS procedures to perform repeated measures with a grouping factor
14.1 ANALYSIS OF COVARIANCE SAS ESSENTIALS -- Elliott & Woodward4 An analysis of covariance (ANCOVA) is a combination of analysis of variance and regression. The covariate is a quantitative variable that is related to the dependent variable. The covariate is not controlled by the investigator but is some value intrinsic to the subject (or entity). In ANCOVA, the group means are adjusted by the covariate, and these adjusted means are compared with each other. Including this covariate variable in the model may explain some of the variability, resulting in a more powerful statistical test.
An Example ANCOVA Analysis SAS ESSENTIALS -- Elliott & Woodward5 A fifth-grade math teacher randomly assigns the 18 students in her class to three different teaching methods for individualized instruction of a certain concept. The outcome of interest is the score on an exam over the concept after the instruction period. An exam of basic math skills (EXAM) is given to the students before beginning the instruction. (This is the covariate) The variable FINAL is compared across the three teaching methods. The covariate EXAM is assumed to be linearly related to FINAL. The ANCOVA tests the null hypothesis that the FINAL means for the three methods adjusted by EXAM are not different.
Data for the Example SAS ESSENTIALS -- Elliott & Woodward6 METHODEXAMFINAL Etc… Notice METHOD is the teaching method. EXAM is the exam given at the first of the class (the covariate) and FINAL is the final test outcome.
The ANCOVA Steps… SAS ESSENTIALS -- Elliott & Woodward7 In this text, this ANCOVA analysis is shown as a multistep approach. 1. Perform a test to determine if the FINAL-by-EXAM linear relationships by METHOD are parallel. That is, when you compare the regression lines for each of the three METHODs, the lines should theoretically be parallel. If the lines are sufficiently nonparallel, ANCOVA is not the appropriate analysis to perform. The statistical test used for parallelism is an F-test. 2. If the F-test in step 1 does not reject parallelism, then another F-test is used to compare the adjusted means across METHOD. 3. If there are differences in means, then appropriate multiple comparisons can be performed to determine which groups differ. Do the Hands On Example p 329 (AGLM ANCOVA.SAS)
Steps for ANCOVA Analysis SAS ESSENTIALS -- Elliott & Woodward8 PROC GLM; CLASS METHOD; MODEL FINAL=EXAM METHOD EXAM*METHOD; TITLE 'Analysis of Covariance Example'; RUN; PROC GLM; CLASS METHOD; MODEL FINAL=EXAM METHOD; LSMEANS METHOD/STDERR PDIFF ADJUST=SCHEFFE; RUN; QUIT; Part 1 – this model includes an interaction term (EXAM*METHOD) that test for parallelism. If the interaction term is not significant, drop it from the model. The LSMEANS code performs a post hoc test to compare (adjusted) METHOD means (if the Ho for equal means is rejected.)
Step 1: Preliminary Results SAS ESSENTIALS -- Elliott & Woodward9 The test for interaction is non-significant (which is good!) The test for interaction is asking the question – are these lines close to parallel? When you do not reject Ho for EXAM*METHOD (when p>0.05) you conclude that the lines are close to parallel, so you can consider them parallel in the primary analysis. (next slide)
Step 2: Analysis of Covariance Analysis SAS ESSENTIALS -- Elliott & Woodward10 With the interaction out of the model, you now test for a significant METHOD effect, In this case P<0.05, so you conclude that the (adjusted) means by METHOD are different. By assuming that the lines are parallel, the ANCOVA is assuming that the actual picture of the regression lines looks like this.
Post Hoc Comparisons SAS ESSENTIALS -- Elliott & Woodward11 Because you conclude that the (adjusted) means by METHOD are different, you perform a post hoc multiple comparison test to determine which means are different. These result from the code: LSMEANS METHOD/STDERR PDIFF ADJUST=SCHEFFE ; From this, conclude that means 1 and 3 are different (p=0.0056), means 2 and 3 are different (p=0.0007), and means 1 and 2 are not different (p=0.684).
Overall Conclusion SAS ESSENTIALS -- Elliott & Woodward12 Because a high FINAL score is the goal, there is evidence to support the contention that the adjusted mean of for METHOD 3 is a significantly higher score than for METHODs 1 and 2, and thus that METHOD 3 is the preferred method.
Two-Factor ANOVA Using PROC GLM SAS ESSENTIALS -- Elliott & Woodward13 A two-way ANOVA is an analysis that allows you to simultaneously evaluate the effects of two experimental variables (factors). Each factor is a "grouping" variable such as type of treatment, gender, brand, and so on. The two-way ANOVA tests determine whether the factors are important (significant) either separately (called main effects) or in combination (via an interaction), which is the combined effect of the two factors.
p x q factorial design SAS ESSENTIALS -- Elliott & Woodward14 The dimensions of a factorial design depend on how many levels of each factor are used. For example, a design in which the first factor has two categories and the second has three categories is called a 2 x 3 (2 by 3) factorial design. Generally, the two-way ANOVA is called a p x q factorial design.
Understanding Fixed and Random Factors SAS ESSENTIALS -- Elliott & Woodward15 FIXED FACTORS: When the factor levels completely define all the classifications of interest. (Example: Gender) RANDOM FACTOR: When the levels used in the experiment are randomly selected from the population of possible levels. (Example 3 different doses of a drug, when there are many possible doses) Two-way ANOVA models are classified as: Both factors fixed: Model I ANOVA Both factors random: Model II ANOVA One random, one fixed: Model III ANOVA
Interaction Effect SAS ESSENTIALS -- Elliott & Woodward16 Interaction implies that the pattern of means across groups is inconsistent (as illustrated below) If there is an interaction effect, main effects cannot be examined directly because the interaction effect shows that differences across one main effect are not consistent across all levels of the other factor. Interaction No Interaction
Typical Hypotheses for Model I Design SAS ESSENTIALS -- Elliott & Woodward17 Test for an interaction effect: H 0 : There is no interaction effect. H a : There is an interaction effect. If there is no interaction effect, it makes sense to test hypotheses about main effects. The "main effects" hypotheses for factor A are as follows: H 0 : Means are equal across levels of A summed over B. H a : Means are not equal across levels of A summed over B. A similar hypothesis is used for B main effects. Do Hands on Example p 335 (AGLM 2FACTOR.SAS)
Code for a Two-Factor ANOVA SAS ESSENTIALS -- Elliott & Woodward18 PROC GLM; CLASS CONDITION STATUS; MODEL RESPONSE=CONDITION STATUS CONDITION*STATUS; MEANS CONDITION STATUS CONDITION*STATUS; RUN; QUIT; The MODEL statement defines the two- way analysis with the two factors CONDITION and STATUS and an interaction test, CONDITION*STATUS; The MEANS statement displays simple statistics by factor.
Results for a Two-Way ANOVA SAS ESSENTIALS -- Elliott & Woodward19 Results are in the TYPE III SS table output. Typically, you first check to see if there is a significant interaction effect. In this case p=0.79, so you conclude NO significant interaction effect. If there is no interaction, examine the main effects tests in the ANOVA table by comparing marginal means.
Examine Marginal Effects (since no interaction) SAS ESSENTIALS -- Elliott & Woodward20 You conclude that there is no difference in means across CONDITION, but there is a difference in means across STATUS. Marginal effects tests indicate no CONDITION effect (p=0.67), and a statistically significant STATUS effect (p=0.0056)
Graphical Results for Two–Way ANOVA SAS ESSENTIALS -- Elliott & Woodward21 Because there are only two categories of STATUS, we do not need to perform multiple comparisons, and the clear conclusion is for STATUS1, (Mean loss = -6.9) (i.e., those who had previously been trying to lose weight) had significantly more weight loss on average than those in STATUS2 (Mean loss= -0.92) Mean loss for each level of STATUS
14.2 GOING DEEPER: TWO-FACTOR ANOVA USING PROC MIXED SAS ESSENTIALS -- Elliott & Woodward22 PROC MIXED performs mixed-model analysis of variance using features not available in PROC GLM. PROC GLM Is designed primarily for fixed effects models. Calculates based on ordinary least squares and method of moments. Defines all effects as fixed and adjusts for random effects after estimation. Can perform mixed model analysis, but some results are not optimally calculated. PROC MIXED Is designed for mixed effects models. Uses generalized least squares for fixed effects and Restricted Maximum Likelihood Estimation (RMLE) to estimate variance components. Allows selection of correlation models. Do Hands on Example p 338 (AMIXED1.SAS)
Code for PROC MIXED ANOVA SAS ESSENTIALS -- Elliott & Woodward23 PROC MEANS MAXDEC=2 MEAN DATA=HEIGHTS; CLASS FAMILY GENDER; VAR HEIGHT; PROC MIXED; CLASS FAMILY GENDER; MODEL HEIGHT = GENDER; RANDOM FAMILY FAMILY*GENDER; RUN; Displays simple statistics by factors. Note how the RANDOM statement indicates which factors are random. (Any interaction with a random factor is random.) Categorical Factors
How the PROC MIXED statement is constructed SAS ESSENTIALS -- Elliott & Woodward24 CLASS FAMILY GENDER; MODEL HEIGHT = GENDER; RANDOM FAMILY FAMILY*GENDER; FAMILY and GENDER appear in the CLASS statement because they are both grouping-type factors. The MODEL statement includes only the fixed factor GENDER. The RANDOM statement includes the random factors FAMILY and FAMILY* GENDER.
Output for Analysis (Means) SAS ESSENTIALS -- Elliott & Woodward25 First output is means by factor.
Output Testing the Fixed Factor SAS ESSENTIALS -- Elliott & Woodward26 This shows the results of a hypothesis test that there is no difference in means by GENDER, which is marginally non-significant (p = 0.07). SAS reports no test involving FAMILY.
What If Random had not be properly assigned SAS ESSENTIALS -- Elliott & Woodward27 Consider the model statement MODEL HEIGHT = GENDER FAMILY GENDER*FAMILY; The results would have been This model results in a DIFFERENT ANSWER! In this case GENDER is significant.
14.3 GOING DEEPER: REPEATED MEASURES WITH A GROUPING FACTOR SAS ESSENTIALS -- Elliott & Woodward28 A common design in medical research and other settings is to observe data for the same subject under different circumstances or over time (longitudinal). Although it is also possible to perform this analysis using PROC GLM, the approach used in PROC MIXED is preferred. In addition, a distinct advantage of using PROC MIXED is that it allows missing values in the design, whereas PROC GLM deletes an entire record from analysis if there is one missing observation. Do Hands On Example p 341 (AREPEAT1.SAS)
Data for Repeated Measures Analysis SAS ESSENTIALS -- Elliott & Woodward29 Suppose you observe a response to a drug over 4 hours for seven subjects, three male and four female. In this case, the repeated measure, HOUR, is longitudinal. SUBGENDER HOUR1HOUR2HOUR3HOUR4 1M M M F F F F Missing9.2
Create a Revised data set SAS ESSENTIALS -- Elliott & Woodward30 DATA REPMIXED(KEEP= SUBJECT GENDER TIME OUTCOME); INPUT SUBJECT GENDER $ HOUR1-HOUR4 ; OUTCOME = HOUR1; TIME = 1; OUTPUT; OUTCOME = HOUR2; TIME = 2; OUTPUT; OUTCOME = HOUR3; TIME = 3; OUTPUT; OUTCOME = HOUR4; TIME = 4; OUTPUT; DATALINES; 1M M M Etc… (a) assign each of the four HOUR values to the variable OUTCOME (b) create a variable called TIME that contains the time marker (1-4) (c) for each of these assignments, output the variables to the new data set REPMIXED using the OUTPUT statement. The data must be manipulated to put it in the right format for the analysis. In this case an HOURx and TIME variables are created and output in a data set named REPMIXED
The revised data set: SAS ESSENTIALS -- Elliott & Woodward31 This shows the revised data set: (Partially shown here)
Basic Model for this Analysis SAS ESSENTIALS -- Elliott & Woodward32 PROC MIXED DATA=REPMIXED; CLASS GENDER TIME SUBJECT; MODEL OUTCOME=GENDER TIME GENDER*TIME; REPEATED / TYPE=UN SUB=SUBJECT; RUN; All factors are FIXED in this analysis The REPEATED statement tells SAS that SUBJECT is repeated (there are 4 per observation.) The TYPE=UN option tells SAS to use an unstructured covariance matrix
Covariance Structures SAS ESSENTIALS -- Elliott & Woodward33 The best fitting model for an analysis may depend on how the variance components are structures – which best explains the variance in the mode. Some covariance structures include: AR(1) (Autoregressive) assumes that nearby measurements are correlated and decline exponentially with time. That is, measurements at TIME1 and TIME2 are more highly correlated than are measurements at TIME1 and TIME3. CS (Compound symmetry) assumes homogeneous variances that are constant regardless of how far apart the measurements are. UN (Unstructured) allows all variances to be different.
What is the best Covariance Structure? SAS ESSENTIALS -- Elliott & Woodward34 PROC MIXED DATA=REPMIXED; CLASS GENDER TIME SUBJECT; MODEL OUTCOME=GENDER TIME GENDER*TIME; REPEATED / TYPE=UN SUB=SUBJECT; RUN; PROC MIXED DATA=REPMIXED; CLASS GENDER TIME SUBJECT; MODEL OUTCOME=GENDER TIME GENDER*TIME; REPEATED / TYPE=CS SUB=SUBJECT; RUN; PROC MIXED DATA=REPMIXED; CLASS GENDER TIME SUBJECT; MODEL OUTCOME=GENDER TIME GENDER*TIME; REPEATED / TYPE=AR(1) SUB=SUBJECT; RUN; Each model has a different TYPE= selected Look at several possible models…
Results from Multiple Models SAS ESSENTIALS -- Elliott & Woodward35 In this example, based on the AIC criterion, the AR(1) specification best fits the data. (Smaller is better.) AIC(unstructured)=71.0 AIC(compound symmetry)=77.6 AIC(autoregressive) = 70.2 AIC is found in this table of the output. This is the output table using the TYPE=AR(1) option. Best result
Choose which Analysis to Use SAS ESSENTIALS -- Elliott & Woodward36 The results from the TYPE=AR(1) analysis are used as the final model, since it had the best fit. Conclude that there is no interaction effect, that there is a TIME effect, and no GENDER effect (or a marginal GENDER effect.)
Visual Examination of the Results SAS ESSENTIALS -- Elliott & Woodward37 Use PROC GPLOT to display the means by factor. (Using AREPEAT2 code) PROC SORT DATA=REPMIXED;BY GENDER TIME; PROC MEANS noprint; BY GENDER TIME; OUTPUT OUT=FORPLOT MEAN=; RUN; TITLE "Repeated Measures Example"; PROC GPLOT; PLOT OUTCOME*GENDER=TIME; SYMBOL1 V=CIRCLE I=JOIN L=1 C=BLACK; SYMBOL2 V=DOT I=JOIN L=2 C=BLUE; SYMBOL3 V=STAR I=JOIN L=2 C=RED; SYMBOL4 V=SQUARE I=JOIN L=2 C=GREEN; RUN; Output means to plot using PROC MEANS Using the outputted means in the file FORPLOT, plot the results using PROC GPLOT
Results of PROC GPLOT SAS ESSENTIALS -- Elliott & Woodward38 Visual confirmation that there is a TIME difference… at least it appear TIME2 (blue)is always lower than TIME2 (red)
A Second Plot SAS ESSENTIALS -- Elliott & Woodward39 Use PROC PLOT to produce a second plot showing the same data in a different perspective…
Comparisons SAS ESSENTIALS -- Elliott & Woodward40 To determine which TIMEs are significantly different, include the following option in the section of PROC MIXED code related to the AR(1) covariance structure. LSMEANS TIME/PDIFF; Indicates a significant difference in marginal means from TIME1 to TIME2 (difference =1.36) p=0.0016… and so on…
14.4 SUMMARY SAS ESSENTIALS -- Elliott & Woodward41 This chapter discusses the use of PROC GLM to perform ANCOVA and two-factor ANOVA with fixed effects. We also introduce PROC MIXED and discuss its use on the two-factor mixed model and a repeated measures analysis with a grouping variable. Continue to Chapter 15:NONPARAMETRIC ANALYSIS