Modern Approaches Effect Size

Modern Approaches Effect Size
Key source: Kline, Beyond Significance Testing

Always a difference Commonly define the null hypothesis as ‘no difference’ Differences between groups always exist (at some level of precision) Obtaining statistical significance can be seen as just a matter of sample size Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)

What should we be doing? Want to make sure we have looked hard enough for the difference – power analysis Figure out how big the thing we are looking for is – effect size

Calculating effect size
Different statistical tests have different effect sizes developed for them However, the general principle is the same Effect size refers to the magnitude of the impact of the independent variable(s) on the outcome variable

Types of effect size Three basic classes of effect size commonly used:
d family vs. r family Focused on standardized mean differences Allows comparison across samples and variables with differing variance Cohen’s d Note sometimes no need to standardize (units of the scale have inherent meaning) Variance-accounted-for Amount explained versus the total Case level effect sizes However one should note that in group comparison situations, each is directly translatable to the another

Cohen’s d Independent samples t-test
Cohen initially suggested could use either sample standard deviation, since they should both be equal according to our assumptions. In practice people now use the pooled variance. Also Cohen’s f for the ANOVA counterpart Others available for particular cases, e.g. control groups, dependent samples

Association A measure of association describes the amount of the covariation between the independent and dependent variables It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size R, eta

Characterizing effect size
Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects

Other effect size measures
Case level effect sizes Measures for non-continuous data Association Contingency coefficient Phi Cramer’s Phi d-family Odds Ratios Agreement Kappa

Case-level effect sizes
Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only However, it is often of interest to estimate differences at the case level Case-level indices of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test)

Case-level effect sizes
U1 Proportion of nonoverlap If no overlap then = 1, 0 if all overlap U2 Proportion of scores in lower group exceeded by the same proportion in upper group If same means = .5, if all group2 exceeds group 1 then = 1.0 U3 Proportion of scores in lower group exceeded by typical score in upper group Same range as U2

Other case-level effect sizes
Tail ratios (Feingold, 1995): Relative proportion of scores from two different groups that fall in the upper extreme (i.e., either the left or right tail) of the combined frequency distribution “Extreme” is usually defined relatively in terms of the number of standard deviations away from the grand mean Tail ratio > 1.0 indicates one group has relatively more extreme scores Here, tail ratio = p2/p1:

Other case-level effect sizes
Common language effect size (McGraw & Wong, 1992) is the predicted probability that a random score from the upper group exceeds a random score from the lower group Find area to the right of that value Range .5 – 1.0

Confidence Intervals for Effect Size
Effect size statistics such as Hedge’s g and η2 have complex distributions Traditional methods of interval estimation rely on approximate standard errors assuming large sample sizes General form for d is the same as that of any CI

Problem However, CIs formulated in this manner are only approximate, and are based on the central (t) distribution centered on zero The true (exact) CI depends on a noncentral distribution and additional parameter Noncentrality parameter What the alternative hype distribution is centered on (further from zero, less belief in the null) d is a function of this parameter, such that if ncp = 0 (i.e. is centered on the null hype value), then d = 0 (i.e. no effect)

Confidence Intervals for Effect Size
Similar situation for r and eta2 effect size measures Gist: we’ll need a computer program to help us find the correct upper and lower bounds for the ncp to get exact confidence intervals for effect sizes Then based on it’s relation to d, transform those bounds Statistica and R (MBESS package) have the functionality built in, several stand alone packages are available, (Steiger) while others allow for such intervals to be programmed (even SPSS scripts are available (Smithson))

Limitations of effect size measures
Standardized mean differences: Heterogeneity of within-conditions variances across studies can limit their usefulness—the unstandardized contrast may be better in this case Measures of association: Correlations can be affected by sample variances and whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections) Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance

How to fool yourself with effect size estimation: 1. Measure effect size only at the group level 2. Apply generic definitions of effect size magnitude without first looking to the literature in your area 3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant (see Prentice & Miller, 1992) 4. Ignore the question of how theoretical or practical significance should be gauged in your research area 5. Estimate effect size only for statistically significant results

6. Believe that finding large effects somehow lessens the need for replication 7. Forget that effect sizes are subject to sampling error 8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study 9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design 10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published 11. Automatically equate effect size with ‘cause size’

Recommendations APA task force suggestions
Report effect sizes Report confidence intervals Use graphics Report and interpret effect sizes in the context of those seen in previous research rather than rules of thumb Report and interpret confidence intervals (for effect sizes too) also within the context of prior research In other words don’t be overly concerned with whether a CI for a mean difference doesn’t contain zero but where it matches up with previous CIs Summarize prior and current research with the display of CIs in graphical form (e.g. w/ Tryon’s reduction) Report effect sizes even for non-statistically significant results

Effect Size Estimation in One-Way Anova

Contrast Review Concerns design with a single factor A with at least 2 levels (conditions) The omnibus comparison concerns all levels (i.e., dfA > 2) A focused comparison or contrast concerns just two levels (i.e.,df = 1) The omnibus effect is often relatively uninteresting compared with specific contrasts (e.g., treatment 1 vs. placebo control) A large omnibus effect can also be misleading if due to a single discrepant mean that is not of substantive interest

Comparing Groups The traditional approach is to analyze the omnibus effect followed by analysis of all possible pairwise contrasts (i.e. compare each condition to every other condition) However, this approach is typically incorrect (Wilkinson & TFSI,1999)—for example, it is rare that all such contrasts are interesting Also, use of traditional methods for post hoc comparisons reduces power for every contrast, and power may already be low in a typical psych data setting

Contrast specification and tests
A contrast is a directional effect that corresponds to a particular facet of the omnibus effect often represented with the symbol y for a population or yˆ for a sample a weighted sum of means In a sample, a contrast is calculated as: a1, a2, ... , aj is the set of weights that specifies the contrast Review Contrast weights must sum to zero and weights for at least two different means should not equal zero Means assigned a weight of zero are excluded from the contrast Means with positive weights are compared with means given negative weights

For effect size estimation with the d family, we generally want a standard set of contrast weights In a one-way design, you want the sum of the absolute values of the weights in to equal two (i.e., ∑| aj | = 2.0) Mean difference scaling permits the interpretation of a contrast as the difference between the averages of two subsets of means

An exception to the need for mean difference scaling is testing for trends (polynomials) specified for a quantitative factor (e.g., drug dosage) There are default sets of weights that define trend components (e.g. linear, quadratic, etc.) that are not typically based on mean difference scaling Not usually a problem because effect size for trends is generally estimated with the r family (measures of association) Measures of association for contrasts of any kind generally correct for the scale of the contrast weights

Orthogonal Contrasts Two contrasts are orthogonal if they each reflect an independent aspect of the omnibus effect For balanced designs and unbalanced designs (latter)

Orthogonal Contrasts The maximum number of orthogonal contrasts is:
dfA = a − 1 a = # of groups of that factor For a set of all possible orthogonal pairwise contrasts, the SSA = the total SS from the contrasts, and their eta-squares will sum to the SSA eta-square That is, the omnibus effect can be broken down into a − 1 independent directional effects However, it is more important to analyze contrasts of substantive interest even if they are not orthogonal

t-test for a contrast against the nil hypothesis The F is

Dependent Means Test statistics for dependent mean contrasts usually have error terms based on only the two conditions compared—for example: s2 here refers to the variance of the contrast difference scores This error term does not assume sphericity, as is the case with repeated measures design

Confidence Intervals Approximate confidence intervals for contrasts are generally fine (i.e. good enough) The general form of an individual confidence interval for Ψ is: dferror is specific to that contrast

There are also corrected confidence intervals for contrasts that adjust for multiple comparisons (i.e., inflated Type I error) Known as simultaneous or joint confidence intervals Their widths are generally wider compared with individual confidence intervals because they are based on a more conservative critical value Program for correcting earch/resources/psyprogram.ht ml

Standardized contrasts
The general form for standardized mean difference (in terms of population parameters)

There are three general ways to estimate σ (i.e., the standardizer) for contrasts between independent means: 1. Calculate d as Glass’s Δ i.e., use the standard deviation of the control group 2. Calculate d as Hedge’s g i.e., use the square root of the pooled within-conditions variance for just the two groups being compared 3. Calculate d as an extension of g Where the standardizer is the square root of MSW based on all groups (i.e. the omnibus test) Generally recommended

To calculate d from a tcontrast for a paper not reporting effect size like they should Recall the weights should sum to 2

CIs Once the d is calculated one can easily obtain exact confidence intervals via the MBESS package in R or Steiger’s standalone program The latter will provide the interval for the noncentrality parameter which must then be converted to d, and since the MBESS package used Steiger’s as a foundation, it is to be recommended

Cohen’s f Cohen’s f provides what can interpreted as the average standardized mean difference across the groups in question It has a direct relation to measures of association (eta-squared) As with Cohen’s d, there are guidelines regarding Cohen’s f .10, .25, .40 for small, moderate and large effect sizes These correspond to eta-square values of: .01, .06, .14 Again though, one should conduct the relevant literature for determining what constitutes ‘small’ ‘medium’ and ‘large’

Measures of Association
A measure of association describes the amount of the covariation between the predictor and outcome variables It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size A squared multiple correlation (R2) calculated in ANOVA is called the correlation ratio or estimated eta-squared (2)

Eta-squared A measure of the degree to which variability among observations can be attributed to conditions Example: 2 = .50 50% of the variability seen in the scores is due to the independent variable.

More than one factor It is a fairly common practice to calculate eta2 (correlation ratio) for the omnibus effect but to calculate the partial correlation ratio for each contrast As we have noted before SPSS calls everything partial eta-squared in it’s output, but for a one-way design you’d report it as eta-squared

Problem Eta-squared (since it is R-squared) is an upwardly biased measure of association (just like R-squared was) As such it is better used descriptively than inferentially

Omega-squared Another effect size measure that is less biased and interpreted in the same way as eta-squared So why do we not see omega-squared so much? People don’t like small values Stat packages don’t provide it by default

Omega-squared Put differently

Omega-squared Assumes a balanced design
eta2 does not Though the omega values are generally lower than those of the corresponding correlation ratios for the same data, their values converge as the sample size increases Note that the values can be negative—if so, interpret as though the value were zero

Comparing effect size measures
Consider our previous example with item difficulty and arousal regarding performance

Comparing effect size measures
2 ω2 Partial 2 f B/t groups .67 .59 1.42 Difficulty .33 .32 .50 .70 Arousal .17 .14 .45 Interaction Slight differences due to rounding, f based on eta-squared

No p-values As before, programs are available to calculate confidence intervals for an effect size measure Example using the MBESS package for the overall effect 95% CI on ω2: .20 to .69

No p-values Ask yourself as we have before, if the null hypothesis is true, what would our effect size be (standardized mean difference or proportion of variance accounted for)? Rather than do traditional hypothesis testing, one can simply see if our CI for the effect size contains the value of zero (or, in eta-squared case, gets really close) If not, reject H0 This is superior in that we can use the NHST approach, get a confidence interval reflecting the precision of our estimates, focus on effect size, and de-emphasize the p-value

Factorial Effect Size

Multiple factors There are some special considerations in designs with multiple factors Ignoring these special issues can result in effect size variation across multiple-factor designs due more to artifacts than real differences in effect size magnitude These issues arise in part out of the problem of what to do about ‘other’ factors when effects on a particular factor are estimated Methods for effect size estimation in multiple-factor designs are also not as well developed as for single-factor designs However, this is more true for standardized contrasts than for measures of association

Special Considerations
It is generally necessary to have a good understanding of ANOVA for multiple-factor designs (e.g., factorial ANOVA) in order to understand effect size estimation However, this does not mean one needs to know about the F test only, which is just a small (and perhaps the least interesting) part of ANOVA The most useful part of an ANOVA source table for the sake of effect size estimation is everything to the left of the usual columns for F and p It is better to see ANOVA as a general tool for estimating variance components for different types of effects in different kinds of designs

Multiple factors Multiple-factor designs arise out of a few basic distinctions, including whether the 1. factors are between-subjects vs. within-subjects 2. factors are experimental vs. nonexperimental 3. relation between the factors or subjects is crossed vs. nested The most common type of multiple-factor design is the factorial design, in which every pair of factors is crossed (levels of each factor are studied in all combinations with levels of all other factors)

Multiple factors Some common types of factorial designs:
Completely between-subjects factorial design: Subjects nested under all combinations of factor levels (i.e.,all samples are independent) Completely within-subjects factorial design: Each case in a single sample is tested under every combination of two or more factors (i.e., subjects are crossed with all factors) Mixed within-subjects factorial design (split-plot or mixed design): At least one factor is between-subjects and another is within subjects (i.e., subjects are crossed with some factors but nested under others)

Multiple factors While there are other distinctions, one that cannot be ignored is whether a factorial design is balanced or not In a balanced (equal-n among groups) design, the main and interaction effects are all independent E.g. True experiment with randomized assignment for all factors For this reason balanced factorials are referred to as orthogonal designs

Multiple factors However, factorial designs in applied research are often not balanced We distinguish between unbalanced designs with 1. unequal but proportional cell sizes 2. unequal and disproportional cell sizes Designs with proportional cell sizes can be analyzed as orthogonal designs as equal n is a special case of being proportional

Multiple factors In nonorthogonal designs, the main effects overlap (i.e., they are not independent) This overlap can be corrected in different ways, which means that there may be no unique estimate of the sums of squares for a particular main effect This ambiguity can affect both statistical tests and effect size estimates, especially measures of association The choice among alternative sets of estimates for a particular nonorthogonal design is best based on rational considerations, not statistical ones

Factorial ANOVA Just as in single-factor ANOVA, two basic sources of variability are estimated in factorial ANOVA, within-conditions and between conditions The total within-conditions variance, MSW, is estimated the same basic way—as the weighted average of the within-conditions variances This is true regardless of whether the design is balanced or not However, estimation of the numerator of the total between-conditions variability in a factorial design depends on whether the design is balanced or not The “standard” equations for effect sums of squares presented in many introductory statistics books are for balanced designs only It is only for such designs that the sums of squares for the main and interaction effects are both additive and unique i.e. we could add up the eta2 for main effects and interaction to get the overall eta2

Factorial ANOVA Contrasts can be specified for main, simple, or interaction effects A single-factor contrast involves the levels of just one factor while we are controlling for the other factors—there are two kinds: 1. A main comparison involves contrasts between subsets of marginal means for the same factor (i.e., it is conducted within a main effect) 2. A simple comparison involves contrasts between subsets of cell means in the same row or column (i.e., it is conducted within a simple effect) A single-factor contrast is specified with weights just as a contrast in a one-way design—we assume here mean difference scaling (i.e., ∑ | ai | = 2.0)

Factorial ANOVA An interaction contrast specifies a single-df interaction effect It is specified with weights applied to cells of the whole design (e.g., to all six cells of a 2×3 design) These weights follow the same general rules as for one-way designs, but see the Kline text for specifics regarding effect size estimation. The weights should also be doubly centered, which means that they sum to zero in any row or column If an interaction contrast in a two-way design should be interpreted as the difference between a pair of simple comparisons (i.e., mean difference scaling), the sum of the absolute values of the weights must be 4.0 As an example, these weights compare the simple effect of A at B1 with the simple effect of A at B3

Factorial ANOVA As is probably clear by now, that there can be many possible effects that can be analyzed in a factorial design This is especially true for designs with three or more factors, for which there are a three-way interaction effect, two-way interaction effects, main effects, and contrasts for any of those One can easily get lost by estimating every possible effect which means it is important to have a plan that minimizes the number of analyses while still respecting essential hypotheses Some of the worst misuses of statistical tests are seen in factorial designs when this advice is ignored E.g. All possible effects are tested and sorted into two categories, those statistically significant and subsequently discussed at length vs. those not statistically significant and subsequently ignored This misuse is compounded when power is ignored, which can vary from effect to effect in factorial designs

Standardized Contrasts
There is no definitive method at present for calculating standardized contrasts in factorial designs However, some general principles discussed by Cortina and Nouri (2000), Olejnik and Algina (2000), and others are that: 1. Estimates for effects of each independent variable in a factorial design should be comparable with effect sizes for the same factor studied in a one-way design 2. Changing the number of factors in the design should not necessarily change the effect size estimates for any one of them

Standardized contrasts may be preferred over measures of association as effect size indexes if contrasts are the main focus of the analysis This is most likely in designs with just two factors Because they are more efficient, measures of association may be preferred in larger designs

Standardized contrasts in factorial designs have the same general form as in one-way designs: d = Ψ/σ* The problem is figuring out which standardizer should go in the denominator Here we’ll present what is described in both Kline and Howell in a general way, though Kline has more specifics

Basically it comes down to putting the variance due to other effects not being considered back into the error variance which will become our standardizer So if we have SS for Therapy Race T*R and Err, if we used the SSerr  MSerr  √MSerr here for a standardizer it would be much smaller than in a one way design comparing groups for the main effect of Therapy (for example) Then we’d run around saying we had a much larger effect than those who just looked at a main effect of therapy, when in fact that may not be the case at all

The solution then is to add those other effects back into the error term What of simple effects? The simple effects would use the same standardizer as would be appropriate for the corresponding main effect

Measures of Association
If using measures of association, the process is the same as in the one-way design

Summary As with the one-way design we have the approach of looking at standardized mean differences (d-family) or variance accounted for assessments of effect size (r-family) There is no change as far as the r-family goes when dealing with factorial design The goal for standardized contrasts is to come up with a measure that will reflect what is seen for main effects and be consistent across studies

Effect Size Repeated Measures

Measures of association
Measures of association are conducted in the same way for repeated measures design In general, partial η2 is SSeffect/(SSeffect + SSerror) And this holds whether the samples are independent or not, and for a design of any complexity

There are three approaches one could use with dependent samples 1. Treat as you would contrasts between independent means 2. Standardize the dependent mean change against the standard deviation of the contrast difference scores, sDΨ 3. Standardize using the square root of MSerror The first method makes a standardized mean change from a correlated design more directly comparable with a standardized contrast from a design with unrelated samples, but the latter may be more appropriate for the change we are concerned with. The third is not recommended as in this case the metric is not generally of the original or change scores and so may be difficult to interpret.

One thing we’d like to be able to do is compare situations that could have been either dependent or independent Ex. We could test work performance at morning and night via random assignment or repeated measures In that case we’d want a standardizer in the original metric, so the choice would be to use those d family measures that we would in the for simple pairwise contrasts for independent samples For actual interval repeated measures (i.e. time), it should be noted that we are typically more interested in testing for trends and the r family of measures

Mixed design r family measures of effect such as η2 will again be conducted in the same fashion For standardized mean differences we’ll have some considerations given which differences we’re looking at

Mixed design For the between groups differences can we calculate as normal for comparisons at each measure/interval and simple effects i.e. use √MSwithin This assumes you’ve met your homogeneity of variance assumption This approach could be taken for all contrasts, but it would ignore the cross- condition correlations for repeated measures comparisons However, the benefit we’d get from being able to compare across different (non-repeated) designs suggests it is probably the best approach One could if desired look at differences in the metric of the difference scores and thus standardized mean changes For more info, consult the Kline text, Olejnik and Algina (2000) on our class webpage, Cortina and Nouri (2000)

ANCOVA r - family of effect sizes, same old story
d-family effect size While one might use adjusted means, if experimental design (i.e. no correlation b/t covariate and grouping variable) the difference should be pretty much the same as original means However, current thinking is that the standardizer should come from the original metric, so run just the Anova and use the sqrt of the MSerror from that analysis In other words, if thinking about standardized mean differences, it won’t matter whether you ran an ANOVA on the post test scores or ANCOVA if you meet your assumptions However, your (partial) r effect sizes will be larger with the ANCOVA as the variance due to the covariate is taken out of the error term

Modern Approaches Effect Size

Similar presentations

Presentation on theme: "Modern Approaches Effect Size"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern Approaches Effect Size

Similar presentations

Presentation on theme: "Modern Approaches Effect Size"— Presentation transcript:

Similar presentations

About project

Feedback