Use of Proc GLM to Analyze Experimental Data

Use of Proc GLM to Analyze Experimental Data
Animal Science 500 Lecture No. October , 2010

PROC GLM The GLM procedure uses the method of least squares to fit general linear models. Among the statistical methods available in PROC GLM are: Regression, Analysis of variance, Analysis of covariance, Multivariate analysis of variance (MANOVA), and partial correlation. SAS/STAT(R) 9.22 User's Guide

PROC GLM PROC GLM analyzes data within the framework of general linear models. PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. The independent variables can be either classification variables, which divide the observations into discrete groups, or continuous variables. Thus, the GLM procedure can be used for many different analyses, including the following: SAS/STAT(R) 9.22 User's Guide

PROC GLM Thus, the GLM procedure can be used for many different analyses, including the following: simple regression multiple regression analysis of variance (ANOVA), especially for unbalanced data analysis of covariance response surface models weighted regression polynomial regression partial correlation multivariate analysis of variance (MANOVA) repeated measures analysis of variance SAS/STAT(R) 9.22 User's Guide

PROC GLM PROC GLM enables you to specify any degree of interaction (crossed effects) and nested effects. It also provides for polynomial, continuous-by-class, and continuous-nesting-class effects. Through the concept of estimability, the GLM procedure can provide tests of hypotheses for the effects of a linear model regardless of the number of missing cells or the extent of confounding. PROC GLM displays the sum of squares (SS) associated with each hypothesis tested and, upon request, the form of the estimable functions employed in the test. PROC GLM can produce the general form of all estimable functions. SAS/STAT(R) 9.22 User's Guide

PROC GLM The REPEATED statement enables you to specify effects in the model that represent repeated measurements on the same experimental unit for the same response, providing both univariate and multivariate tests of hypotheses. The RANDOM statement enables you to specify random effects in the model; expected mean squares are produced for each Type I, Type II, Type III, Type IV, and contrast mean square used in the analysis. Upon request, tests that use appropriate mean squares or linear combinations of mean squares as error terms are performed. SAS/STAT(R) 9.22 User's Guide

PROC GLM The ESTIMATE statement enables you to specify an vector for estimating a linear function of the parameters . The CONTRAST statement enables you to specify a contrast vector or matrix for testing the hypothesis that . When specified, the contrasts are also incorporated into analyses that use the MANOVA and REPEATED statements. The MANOVA statement enables you to specify both the hypothesis effects and the error effect to use for a multivariate analysis of variance. SAS/STAT(R) 9.22 User's Guide

PROC GLM PROC GLM can create an output data set containing the input data set in addition to predicted values, residuals, and other diagnostic measures. PROC GLM can be used interactively. After you specify and fit a model, you can execute a variety of statements without recomputing the model parameters or sums of squares. SAS/STAT(R) 9.22 User's Guide

PROC GLM For analysis involving multiple dependent variables but not the MANOVA or REPEATED statements, a missing value in one dependent variable does not eliminate the observation from the analysis for other dependent variables. PROC GLM automatically groups together those variables that have the same pattern of missing values within the data set or within a BY group. This ensures that the analysis for each dependent variable brings into use all possible observations. SAS/STAT(R) 9.22 User's Guide

Estimable Function Often see an error in SAS non-est.
What does this mean?

Estimability Generalized inverses are used to obtain solutions for effects in general linear models. There are many generalized inverses. Many different sets of solutions are possible. Estimable are unique and don’t depend on the generalized inverse used to obtain solutions. To analyze data properly, that is answer the hypothesis being tested, the scientist should know what function of the parameters in the model are being estimated.

Estimability The hypothesis being tested is NOT the absolute values for a level of a factor in the model. Usually asking or hypothesizing that two means are different or some treatment is different from a control. Hence the differences are estimable function NOT the values (solutions) for any of the functions.

The General Linear Model
The main effects general linear model can be parameterized as Yij = µ + αi + bj + εij Where Y observation for ith α, µ is the overall mean (unknown fixed parameter), αi effect of the ith value of α (αi - µ), bj effect of the jth value of b (bj - µ), and εij is the experimental error N(0,δ2)

The General Linear Model
In matrix terminology, the general linear model may be expressed as Y = Xβ + ε where Y the observed data vector, X the design matrix, β is a vector of unknown fixed effect parameters, and ε is the vector of errors

Programming the General Linear Model
In the GLM procedure, one saves the data set plus the residuals, predicted values, and studentized residuals with an output statement in a data set called resdat. PROC GLM; class machine operator; Model yield=machine|operator; output out=resdat r=resid p=pred student=stdres rstudent=rstud cookd=cksd h=lev;

Assumptions of the general linear model
var(ε) = σ2 I var(Y) = σ2 I E(Y ) = Xβ

Assumptions of the Linear Regression Model
Linear Functional form Fixed independent variables Independent observations Representative sample and proper specification of the model (no omitted variables) Normality of the residuals or errors Equality of variance of the errors (homogeneity of residual variance) No multicollinearity No autocorrelation of the errors No outlier distortion

Explanation of the Assumptions
Linear Functional form Does not detect curvilinear relationships The Observations are Independent observations Representative sample from some larger population If the observations are not independent results in an autocorrelation which inflates the t and r and f statistics which in turn distorts the significance tests Normality of the residuals Permits proper significance testing similar to ANOVA and other statistical procedures Equal variance (or no heterogenous variance) Heteroskedasticity precludes generalization and external validity This too distorts the significance tests being used Multicollinearity (many of the traits exhibit collinearity) Biases parameter estimation. Can prevent the analysis from running or converging (getting your answers) Severe or several outliers will distort the results and may bias the results. If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates

SAS test for residual normality
Proc univariate data=resdat normal plot; var resid; Run; Quit;

Graphically examining residuals for homogeneity
Proc gplot data=resdat; plot resid * pred; Run; Quit; Analysis for lack of pattern;

Testing for outliers Proc freq data=resdat; tables stdres cksd; Run; Quit; 1. Look for standardized residuals greater than 3.5 or less than – And look for high Cook’s D (greater than 4*p/(n-p-1).

Class Statement Variables included in the CLASS statement referred to as class variables. Specifies the variables whose values define the subgroup combinations for the analysis. Represent various level of some factors or effects Treatment (1,….n) Season (spring, summer, fall, and winter coded 1 through 4) Breed Color Sex Line Day Laboratory

Evaluating outliers 1.Check coding to spot typos 2. Correct typos
3. If observational outlier is correct, Examine the dffits option to see determine how much influence the outlier has on the fitting statistics. This will show the standardized influence of the observation on the fit. If the influence of the outlier is bad, then consider removal making it a missing observation ( . )

Getting started with GLM

PROC GLM Syntax PROC GLM <options> ; CLASS variables </ option> ; MODEL dependent-variables=independent-effects </ options> ;

Positional Requirements for PROC GLM Statements
Must Precede... Must Follow... ABSORB First RUN statement BY CLASS MODEL statement CONTRAST MANOVA, REPEATED, or RANDOM statement ESTIMATE FREQ ID LSMEANS MANOVA CONTRAST or MEANS MODEL CONTRAST, ESTIMATE, CLASS statement LSMEANS, or MEANS statement OUTPUT RANDOM REPEATED CONTRAST, MODEL, or TEST statement TEST MANOVA or REPEATED statement WEIGHT

Statements in the GLM Procedure
Description ABSORB Absorbs classification effects in a model BY Specifies variables to define subgroups for the analysis CLASS Declares classification variables CONTRAST Constructs and tests linear functions of the parameters ESTIMATE Estimates linear functions of the parameters FREQ Specifies a frequency variable ID Identifies observations on output LSMEANS Computes least squares (marginal) means MANOVA Performs a multivariate analysis of variance MEANS Computes and optionally compares arithmetic means MODEL Defines the model to be fit OUTPUT Requests an output data set containing diagnostics for each observation RANDOM Declares certain effects to be random and computes expected mean squares REPEATED Performs multivariate and univariate repeated measures analysis of variance STORE Requests that the procedure save the context and results of the statistical analysis into an item store TEST Constructs tests that use the sums of squares for effects and the error term you specify WEIGHT Specifies a variable for weighting observations

Class Variables Are usually things you would like to account for in your model Can be numeric or character Can be continuous values They are generally not used in regression analyses What meaning would they have

Class Statement Options
Ascending sorts class variable in ascending order Descending sorts class variable in descending order Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all

Discrete Variables A discrete variable is one that cannot take on all values within the limits of the variable. Limited to whole numbers For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. The variable cannot have the value 1.7. A variable such as a person's height can take on any value. Discrete variables also are of two types: unorderable (also called nominal variables) orderable (also called ordinal)

Discrete Variables Data sometimes called categorical as the observations may fall into one of a number of categories for example: Any trait where you score the value Lameness scores Body condition scores Soundness scoring Reproductive Feet and leg Behavioral traits Fear test Back test Vocal scores Body lesion scores

Discrete Variables When do discrete variables become continuous or do they? What is a trait like number born alive considered discrete or continuous?

Example Variables Data:
The dependent variable (what is being measured) is aerial biomass and there are five substrate measurements: (These are the independent variables) Salinity, Acidity, Potassium, Sodium, and Zinc.

Covariates a covariate is a independent variable that contribute variation to the dependent variable of interest. The research wants to account for the covariate differences that occurs for each observation. A covariate may be of direct interest or it may be a confounding or interacting type of variable

Covariates Examples Weight of animal at measurement
Age of animal at measurement Age of animal at weaning Parity of sow for number born alive and weaning weight Days of lactation for milk weight

Covariates Covariate may influence the dependent variable in the following ways Linear covariate Quadratic covariate Cubic covariate

Covariates Check to be sure your covariate is significant
If the linear is significant, test the quadratic If the linear and quadratic are significant sources of variation test the cubic How do you do that?

Covariates How do you do that?
Linear include the variable name in the model not listed in the class statement. Example weight Quadratic the variable name is included as follows weight*weight Cubic the variable name is included as follows weight*weight*weight

Covariates Covariate may influence the dependent variable in the following ways Linear covariate Independent covariate affects the dependent variable in a linear manner Quadratic covariate Independent covariate affects the dependent variable in a linear quadratic manner Indicates there is an inflection point (and only one) Cubic covariate Independent covariate affects the dependent variable in a linear cubic manner Indicates there are two inflection points

Covariates Covariate may influence the dependent variable in the following ways Linear covariate Independent covariate affects the dependent variable in a linear manner Dependent variable increase or decreases at a constant rate

Covariates Covariate may influence the dependent variable in the following ways Quadratic covariate Independent covariate affects the dependent variable in a linear quadratic manner Indicates there is an inflection point (and only one) The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa

Covariates Cubic covariate
Independent covariate affects the dependent variable in a linear cubic manner Indicates there are two inflection points Essentially the same as quadratic but the changes can occur at an additional point The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa

Model Development and Selection of Variables
Example: The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.

Example Data Origination (Dr. P. J. Berger)
Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.

Example Data Or, Objective:
Find the substrate variable, or combination of variables, showing the strongest relationship to biomass. Or, From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass. Find the independent variables that can be used to predict aerial biomass.

Example Data Class vegetative_type location sites Model
Recall 3 vegetative types evaluated Recall 3 locations where tests occurred Recall 5 sites within each location Model Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;

Example Data Model Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc; Would need to examine assuming each linear affect was signficant salinity*salinity salinity*salinity*salinity acidity*acidity acidity*acidity*acidity, Etc.

PROC GLM Example Example Strawberry yield is modeled as a function of strawberry variety, type of fertilizer, and their interaction. PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety Fertiliz*variety / SOLUTION; LSMEANS fertiliz variety; Run; Quit; The SOLUTION statement is useful for showing the relative effect sizes.

PROC GLM Example Output
General Linear Models Procedure Class Level Information FERTILIZ K N VARIETY Red Sweet Number of observations in data set = 24 This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable.

Dependent Variable: YIELD Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total R-Square C.V Root MSE YIELD Mean This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine.

Source DF Type I SS Mean Square F Value Pr > F FERTILIZ VARIETY FERT*VAR Source DF Type III SS Mean Square F Value Pr > F FERTILIZ VARIETY FERT*VAR SAS presents Type I and Type III sums of squares and F statistics for their significance under a particular set of assumptions; namely, that fertilizer and variety should be modeled with fixed effects, and that the random error terms satisfy their requirements. The F test statistics shown here are not always the proper results to interpret! This depends on the design of the experiment.

The Type I sums of squares are also called sequential sums of squares. Here, they test: Whether fertilizer is a significant predictor Whether variety is significant when considered in addition to fertilizer Whether the interaction is significant when considered in addition to both fertilizer and variety. The Type III sums of squares are also called partial sums of squares. Here, they test: Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for fertilizers to be different from each other? Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for varieties to be different from each other? Knowing that fertilizers and varieties could be different from each other, is the difference between fertilizers the same for both varieties?

Because the experiment is balanced, both Type I and Type III sums of squares are identical. Usually, the Type III sums of squares are used for inference, although the Type I sums of squares are used in specific situations. SAS can calculate Type II and Type IV sums of squares as well.

Solution option used after the model statement (i.e. /solution;) Parameter Estimate T for H:0 Parameter=0 Prob > |T| Std. Error of Estimates INTERCEPT 9.13 B 66.75 0.001 0.137 FERTILIZ - K 0.30 B -1.55 0.194 N 0.00 B . Variety Red B Sweet Fert x Var K Red 0.10 B 0.37 0.719 0.274 K Sweet N Red

There are many ways to estimate effects in a linear model with categorical predictors (fixed effects). SAS chooses to do so by alphabetizing the levels of each factor, then assigning an effect size of zero to the last alphabetically-ordered level of each factor and its interactions. To predict the response for, say, Fertilizer K for the Red variety, use the equation (Intercept) + (K effect) + (Red effect) + (K*Red interaction effect), or = 8.60. The t-test values listed on the right can be used to test if certain parameters are significantly different from zero; in this case, they compare the levels of each factor to the last alphabetically-ordered level (which is forced to be zero). The SOLUTION statement is useful for determining how treatment effects can be contrasted or estimated within PROC GLM.

PROC GLM Example Examining the Error values
An analysis of a general linear model should include a check of the assumptions about the random error terms. To do this in PROC GLM, you must use an OUTPUT statement. The following statements show how to produce a residual plot for the model above.

PROC GLM Example Examining the Error values
PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety fertiliz*variety/SOLUTION; OUTPUT OUT=results P=pred R=resid; PROC GLM DATA=results; LPOT resid*pred; RUN; Quit;

Use of Proc GLM to Analyze Experimental Data

Similar presentations

Presentation on theme: "Use of Proc GLM to Analyze Experimental Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Use of Proc GLM to Analyze Experimental Data

Similar presentations

Presentation on theme: "Use of Proc GLM to Analyze Experimental Data"— Presentation transcript:

Similar presentations

About project

Feedback