Single Factor Analysis of Variance

Single Factor Analysis of Variance
Topic 8 – One-Way ANOVA Single Factor Analysis of Variance Reading: 17.1, 17.2, & 17.5 Skim: 12.3, 17.3, 17.4

Overview Categorical Variables (Factors) Review: Two-sample T-test
Fixed vs. Random Effects Review: Two-sample T-test ANOVA as a generalization of the two-sample T-test Cell-Means and Factor-Effects ANOVA Models (same model, different form)

Terminology: Factors & Levels
The term factor is generally used to refer to a categorical predictor variable. Blood Type Gender Drug Treatment Other Examples? The term levels is used to refer to the specific categories for a factor. A / B / AB / O (could also consider +/-) Male / Female

Factors: Fixed or Random?
A factor is fixed if the levels under consideration are the only ones of interest. The levels of the factor are selected by a non-random process AND are the only levels of interest. For the time being, all factors that we will consider will be fixed. Examples?

Factors: Fixed or Random? (2)
A factor is random if the levels under consideration may be regarded as a sample from a larger population. Not all levels of interest are included in the study – only a random sample. We want to inferences to be applicable to the entire (larger) population of levels. Examples? Analysis is a little more complicated; we’ll save this topic for near the end of the course.

Example: Random or Fixed?
To study the effect of diet on cattle, an experimenter randomly (and equally) allocates 50 cows to 5 diets (a control and 4 experimental diets). After 1 year, the cows are butchered and the amount of good meat (in pounds) is measured. Response = ______________ Cow = _______ Factor Diet = _______ Factor

Notation In general, we label our factors A, B, C, etc.
Factor A has levels i = 1, 2, 3, ..., a Factor B has levels j = 1, 2, 3, ..., b Factor C has levels k = 1, 2, 3, ..., c More on notation later; remember for now we are considering single factor ANOVA, so we will have only a “Factor A”.

Comparing Groups Suppose I want to compare heights between men and women. How would I do this?

Notation for Two-Sample Settings
Suppose an SRS (simple random sample) of size n1 is selected from the 1st population, and another SRS of size n2 is selected from the 2nd population. Population Sample size Sample mean Sample standard deviation 1 2 n1 n2 s1 s2

Estimating Differences
A natural estimator of the difference is the difference between the sample means: If we assume that both populations are normally distributed (or CLT applies) then both sample means and their difference will be normally distributed as well. Because we are estimating standard deviations, a confidence interval for the difference in means uses the T-distribution.

CI for Difference If variances are unknown, then a 95% confidence interval for difference in means is given by The critical value is The degrees of freedom is n1 + n2 – 2.

Test for Difference = 0 Can also be viewed as a hypothesis test
Test statistic for testing whether the difference is zero: Compare to critical value used in CI.

Conclusions If the test statistic is of larger magnitude (ignore sign) than the critical value, we reject the hypothesis There is a significant difference between the two groups The same conclusion results if the CI doesn’t contain zero. If the statistic is smaller (CI does contain zero), we fail to reject the hypothesis Fail to show a difference between the two groups

Comparison of Several Groups
Suppose instead of two groups, we have “a” groups that we wish to compare (where a > 2). Note: In Chapter 17, textbook defines the number of groups as “k”. Remember this is just a letter, and the letter we use really has nothing to do with anything in particular. So I’m using a to correspond (consistently) to Factor A.

Multiple treatment model
With a groups (treatments), then we could do two-sample t-tests. But... This does not test the equality of all means at once Multiple tests means we have greater chance of making Type I errors (a Bonferroni correction can get expensive because of the large number of tests). We usually expect variances to be the same across groups, but it isn’t clear how we should estimate variance with more than two samples.

Multiple treatment model (2)
Analysis of Variance (ANOVA) models provide a more efficient way to compare multiple groups. For example, in a single factor ANOVA, The Model (or ANOVA) F-test will test the equality of all group means at the same time. There are methods of doing pairwise comparisons that are much more efficient than Bonferroni. All observations (from all groups) are used to estimate the overall variance (by MSE).

Three Ways to View ANOVA
Views observations in terms of their group meanscell means model Views observations as the sum of an overall mean, a deviation from that mean related to the particular group to which the observation belongsfactor effects model As regression, using indicator variables.

ANOVA Model Cell Means Model

ANOVA ANOVA is generally viewed as a an extension of the T-test but used for comparisons of three or more population means. These populations are denoted by the levels of our factor. Only one variable, but has 3+ levels or groups Hence we call the means of these levels factor level means or simply cell means.

Cell Means Model Basic ANOVA Model is: where Notation:
“i” subscript indicates the level of the factor “j” subscript indicates observation number within the group

Cell Sizes For the time being, we will assume that all the cell sizes are the same: The total sample size will be denoted

Assumptions for fixed effects
Random samples have been selected for each level of the factor. All observations are independent. Response variable is normally distributed for each population (level) and the population variances are the same. Hence, independence, normality and constant variance What happened to linearity?

Robustness ANOVA procedures are generally robust to minor departures from the assumptions (i.e. minor deviations from the assumptions will not affect the performance of the procedure). For major departures, transformations of the response variable [e.g. Log(Y)] may help. Transforming the Factor(IE predictor) in ANOVA doesn’t help because it’s categorical

Components of Variation
Variation between groups gets “explained” by allowing the groups to have different means. We know this as SSM, SSR, or now SSA! Variation within groups is unexplained. We know this as SSE (it stays the same ) The ratio F = MSM / MSE forms the basis for testing the hypothesis that all group means are the same. (or F = MSA / MSE)

Variation: Between vs. Within
A convenient way to view the SS SSA is called the “between” SS because it represents variation between the different groups. It is determined by the squared differences between group means and the grand (overall) mean. SSE is called the “within” SS because it represents variation within groups. It is determined by the squared differences of observations from their group means.

Quick Comment on Notation
DOT indicates “sum” BAR indicates “average” or “divide by cell/sample size” is the mean for all observations is the mean for the observations in Level i of Factor A.

Pictorial Representation
GROUP 1 GROUP 2 GROUP 3

SS Breakdown (Algebraic)
Break down difference between observation and grand mean into two parts: BETWEEN WITHIN GROUPS GROUPS

Components of Variation (2)
Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: BETWEEN WITHIN GROUPS GROUPS

ANOVA Table Source SS df MS F Factor A SSA a – 1 MSA Error SSE N – a
MSA MSE Error SSE N – a MSE Total SST N – 1

Model F Test (Cell Means)
Null Hypothesis Alternative Hypothesis

Conclusion If we reject the null hypothesis, we have shown differences between groups (levels) Remember it does not tell us which groups are different. Only that at least one group is different from at least one other group! If we fail to reject the null hypothesis, we have failed to show any significant differences with the ANOVA F test Unfortunately sometimes if we look a little closer (we’ll do this later) we still might find some differences!

Calculations: A Brief Look
We’ll consider these for only a balanced design (cell sizes all the same n). The purpose in doing this is not that you memorize formulas, but that you further your conceptual understanding of the sums of squares.

SS Calculations(Balanced)

Blood Type Example (1) Suppose we have 3 observations of a certain response variable for each blood type Want to construct the ANOVA table

Blood Type Example (2) We can compute the sample means using SAS:

Blood Type Example (3) SSA (Between)
At this point, we have a choice – to calculate SSE or SST.

Blood Type Example (4)

Blood Type Example (5) DF: 4 – 1 = 3 for Factor A
DF: N – 1 = 11 for Total DF: 11 – 3 = 8 for Error Mean Squares:

Blood Type Example (6) ANOVA Table
F-test is significant, and so we conclude that there is some difference among the means (we just don’t know exactly which means are different). Source SS df MS F Between 231.00 3 77.00 36.95 Within 16.67 8 2.084 Total 247.67 11

SAS Coding Will use PROC GLM with an important addition: CLASS statement CLASS statement identifies categorical variables for SAS Note that failure to use CLASS statement for categorical variable will result in: SYNTAX ERROR if character variable INAPPROPRIATE ANALYSIS if class levels are numeric

Blood Type Example (SAS)

Residual Diagnostics Very similar to what we did in regression
Normality plot is the same – keep in mind that most of the tests in ANOVA are robust to minor violations of normality (thanks to the CLT). In constant variance plot, still may see megaphone shape in RESID vs. PRED if non-constant variance is a problem. In plots against the factor levels (commonly used), would simply see differing vertical spreads (not megaphone, because generally the labels on the horizontal axis are not “ordered”)

Blood Type (QQ Plot)

Blood Type (Residual Plot)

Model Estimates In SAS, using /solution as an option in the MODEL statement of PROC GLM, we can get the parameter estimates for our model. Unfortunately these are not the cell means!

Cell or Group Means To get each cell mean or just add the intercept to each parameter estimate

Model Estimates The reason for this is that there are infinitely many ways to write down the model for ANOVA. SAS tells us this by saying ALL estimates are “biased”. So what is SAS actually doing?

Factor Effects Model (Another convenient view)
ANOVA Model Factor Effects Model (Another convenient view)

A simple example Three groups: Grand Mean  

Factor Effects Model An alternative to viewing each observation as a deviation from the cell mean, we may consider observations as deviations from the grand (or overall) mean. Part of that deviation is explained by the cell (or group). We call that part or factor level effects. We essentially break from the cell-means model into two pieces:

Factor Effects Model is the grand (or overall) mean.
is the ith treatment effect (difference between group mean and ) is the error component. is the ith treatment mean. Restriction is made.

Why the Restriction? Note that estimating would require one more estimate than in the cell means model So for the models to be identical, we must add a constraint. Convenient: makes the grand (or overall) mean. What exactly does SAS do?

Restriction made by SAS
Last level (alphabetically!!!) is set to ZERO. This means the intercept (estimate for ) will represent the mean for the “last” group. So they are not exactly the factor effects, but can we recover factor effects from this?

Estimating Factor Effects
We previously calculated the cell means (this is the first step):

Estimating Factor Effects (2)
The overall mean will be the weighted average of the group means (in this case, it’s a straightforward average since the cell sizes are identical):

Estimating Factor Effects (3)
The factor effects are the differences between the group and overall means: Note: Sum of these is ZERO always.

Estimates / Tests Alphas are estimated by
For the model F test: Testing the hypothesis that all the means are the same is equivalent to testing against the alternative

ANOVA as REGRESSION We’ll look at this only briefly, as in practice we don’t generally view ANOVA in this way. But SAS does! So part of the context here is to help us understand (eventually) how ANOVA models work in SAS.

Dummy Variables When we view ANOVA as a regression model, we do so using dummy variables. We’ve already seen such a variable and even used it in the some examples where we had only two possible categories: Smoking Status (Yes = 1, No = 0) Gender (Male = 1, Female = 0)

What is a Dummy Variable?
The most important thing about dummy variables is that the numeric value has no meaning beyond defining the category. We could, for example, take (No = 1, Yes = 0) or (Female = 1, Male = 0) on the previous slide. Additionally, we could use (Yes = 1, No = -1) without changing the flavor of the results. (the meaning of your parameter estimates would change, but the final interpretations would remain the same)

Extension to Many Groups
If my categorical factor has a levels, then I will need a – 1 dummy variables to represent the factor. Example: Blood Type (A, B, AB, O) X1 = 1 if blood type = A; else X1 = 0 X2 = 1 if blood type = B; else X2 = 0 X3 = 1 if blood type = AB; else X3 = 0

Degrees of Freedom Recall our ANOVA model used a – 1 DF in the model (one fewer than the number of levels for the factor). Why? Because of these indicator variables. It takes a – 1 indicator variables to encompass our categorical variable. That’s a – 1 slope estimates, and hence a – 1 DF. In general, any categorical variable in your model will cost DF equal to the number of levels minus one.

Extension to Many Groups (2)
My “Regression” Model will be What do the parameters represent? What is being tested with the overall model F test?

Blood Type Example Model: is the true mean for blood type O.
is the true mean for type A. is the true mean for type B. is the true mean for type AB. And here are some fairly natural estimates:

Blood Type Example (2) Standard errors for these estimates are also fairly intuitive since in general the standard error for a mean is of the form For example,

Blood Type Example (3) How do we test hypotheses?
H0: All means the same H0: Mean for Type AB = Mean for Type O H0: Mean for Type AB = Mean for Type A

Summary One level of our factor gets represented by the intercept. The slope estimates compare all other levels to that “base” level. We can compare any set of levels that we want using a general linear test This is exactly what SAS does for any ANOVA! But the output in SAS will be in a different form to make the interpretations easier.

CLG Activity

Questions?

Pairwise Comparisons (Sec. 17.7-17.8) Randomized Blocks (Chapter 18)
Upcoming in Topic 9... Pairwise Comparisons (Sec ) Randomized Blocks (Chapter 18)

Single Factor Analysis of Variance

Similar presentations

Presentation on theme: "Single Factor Analysis of Variance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Single Factor Analysis of Variance

Similar presentations

Presentation on theme: "Single Factor Analysis of Variance"— Presentation transcript:

Similar presentations

About project

Feedback