Unit 15 Power Analysis and Statistical Validity

Unit 15 Power Analysis and Statistical Validity

Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR (1-) True Negative TYPE 2 ERROR () False Negative Researcher concludes: REJECT NULL; TYPE 1 ERROR () False Positive CORRECT REJECT (1-) True Positive

Power: The probability of rejecting the null hypothesis when it is false.
symbol is 1 - , where  is the probability of making a Type II error--failing to reject the null hypothesis when you should In everyday language, power is the probability of concluding that variables are related (or groups differ) on the basis of your sample, when the variables actually are related (or groups differ) in the population.

What Determines Power? Some statistical tests are more powerful (i.e., better at detecting real/non-zero population effects) than others. Parametric tests often are more powerful than non-parametric, because they work with more information from the data. GLM is MVUE when assumptions are met

What Factors Affect Power in GLM?
1. Alpha (significance level): As alpha increases (e.g., goes from .01 to .05, power increases--with less stringent alpha, easier to reject the null hypothesis of no relation/difference 2. Sample size: As N increases, power increases--with larger n, standard error of parameter estimates are reduced, so small, non-zero effects are less likely to be due to sampling error. “Critical value” also gets larger 3. Magnitude of effect in the population: As magnitude of the effect in the population goes up, power increases– easier to determine that population effect <> 0 if it is bigger because sample bs will be bigger on average 4. Number of parameters in Model A (and effect): As either the model get more complex with more parameters (> PA) or the effect gets more complex (> PA – PC), power goes down because SE gets larger. Critical value also gets larger. 5. Error (unexplained variance in Y): As SSE(A) goes up (and model A R2 goes down) power is lower– larger error means larger standard error of parameter estimates. More difficult to tell if effects <> 0 because bs vary more greatly from sample to sample (and from zero) by chance.

Power Conventions Desired level of power: depends on purpose, but usually the more the better. The value of .80 has become a minimum threshold standard (much like alpha= .05 for significance testing. Higher power also means more precision in estimating that magnitude of the effect (tighter CI) Two strategies: Determine number of subjects needed (N*) for given level of power (e.g. .80) Determine power for a given design (e.g., completed experiment with fixed N)

Power can be calculated for tests of
Effect for single regressor, subset of regressors controlling for other regressors in model, or all regressors in the model. A priori power analysis (for sample size planning) Must set alpha level, PA, PC, and desired power Specify (calculate) effect size (e.g., partial eta2 or R2 ) you expect. Be somewhat conservative. N will be a function of the above factors. Post hoc power analysis (WHEN?) Must set alpha level, PA, PC, and N. Specify minimum effect size of interest. Be VERY conservative Power will be a function of above factors

Effect size: Partial Eta2
You can calculate partial eta-squared for any reported effect Partial eta-squared = F * ndf (F * ndf + ddf) Where ndf and ddf are numerator and denominator degrees of freedom from F statistic

Cohen (1992) Rules of Thumb for partial Eta2
The values of partial eta2 that Cohen implicitly defines as small, medium and large vary across research context/statistical test MR ANOVA t-test r Small: .02; (R2 = .02) .01 (f = .10) .01 (d = .20) .01 (r = .10) Medium: .13; (R2 = .13) .06 (f = .25) .06 (d = .50) .09 (r = .30) Large: ; (R2 = .26) .14 (f = .40) .14 (d = .80) .25 (r = .50) Use MR effect sizes if you are testing model R2 or a set (> 1) of quantitative regressors. Use ANOVA/t effect sizes if you are testing one categorical variable Use r effect sizes if you are testing one quantitative variable

A Priori Power Analysis for One Parameter in MR
How many subjects are needed for 80% power to detect an partial eta-squared of .09 for one predictor in a model with 5 predictors at an alpha of 0.05? > modelPower(pa=6, pc=5, power=.80, peta2=.09) Results from Power Analysis pEta2 = 0.090 pa = 6 pc = 5 alpha = 0.050 N = power = 0.800

How many subjects are needed for 85% power to detect an ΔR2 of .05 for one predictor in a model with 3 predictors and an R2 of .30 at an alpha of 0.05? > modelPower(pa=4 , pc=3, power=.85, dR2=.05, R2=.30) Results from Power Analysis dR2 = ; R2 = pa = 4 pc = 3 alpha = 0.050 N = power = 0.850 modelPower() also supports f2 as effect size

Hefner et al (2013): Duration task

How many subjects are needed to detect the Beverage Group X Cue type interaction based on the effect size observed in Hefner et al. with 90% power and an alpha of .05 pEta2 = (F * ndf) / (F * ndf + ddf) = (3.452 * 1) / (3.452 * ) = .157 > modelPower(pa=4 , pc=3, power=.90, peta2=.157) Results from Power Analysis pEta2 = 0.157 pa = 4 pc = 3 alpha = 0.050 N = power = 0.900

Post Hoc Power Analysis for One Parameter in MR
How much power did Hefner et al have to detect a moderate effect size (partial eta-squared of .06), with alpha = .05 > modelPower(pc=3, pa=4, N=68, peta2=.06) Results from Power Analysis pEta2 = 0.060 pa = 4 pc = 3 alpha = 0.050 N = power = 0.525

Problems from Low Power
Low probability of finding true effects Low positive predictive value (PPV) 3. An exaggerated estimate of the magnitude of the effect when a true effect is discovered

Miss True Effects Low power, by definition, means that the chance of discovering effects that are genuinely true is low. If study has 20% power and there really is an effect, you only have a 20% chance of finding it. Low-powered studies produce more false negatives (misses) than high-powered studies. We tend to focus on false alarms but misses can be equally costly (e.g., new treatments) You waste resources (time, money) with low powered studies. One high power study (N) >>> 2 low power (N/2)

Low Positive Predictive Value (PPV)
The lower the power of a study, the lower the probability that an observed “significant” effect (among the set of all significant effects) actually reflects a true non-zero effect in the population (vs. a false alarm). Called the Positive Predictive Value (PPV) of a claimed discovery. If alpha = .05, what is PPV? You need more information. Probably not 95%!!!

Low Positive Predictive Value (PPV)
PPV = ([1 – ] × OR) ⁄ ([1− ] × OR + ) where: (1 − β) is the power, β is the type II error, α is the type I error. OR is the pre-study odds ratio (that is, the odds that an effect is indeed non-null among the effects being tested in a field or other set). Formula can be rewritten: PPV = (Power × OR) ⁄ (Power × OR + )

100 95 20 5 80 A priori odds ratio that effect exists: 1 (1:1 or 1/2)
Power 80% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1 – ] × OR + ) PPV = ([1−.20] × 1) ⁄ ([1−.20] × ) = .80 / .85 = .94 100 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 95 20 5 80

100 95 80 5 20 A priori odds ratio that effect exists: 1 (1:1 or 1/2)
Power 20% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1 – ] × OR + ) PPV = ([1−.80] × 1) ⁄ ([1−.80] × ) = .20 / .25 = .80 100 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 95 80 5 20

A priori odds ratio that effect exists: .25 (1:4 or 1/5)
Power 20% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1− ] × OR + ) PPV = ([1−.80] × .25) ⁄ ([1−.80] × ) = .05 / .10= .50 160 40 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 152 32 8 8

Sample estimate of effect is too large
When an underpowered study discovers a true effect, it is likely that the estimate of the magnitude of that effect provided by that study will be exaggerated. Effect inflation is worst for small, low-powered studies, which can only detect sample parameter estimates effects that happen to be large. Why does this make sense? Consider example where = 5, SE = 4. Draw sampling distribution. Indicate region where sample bs would be significant. What can we conclude about estimates (b) for  in any one study or across a bunch of studies.

Other Biases that often come with Low Power
Vibration of effects Publication bias, selective data analysis and selective reporting of outcomes. May be of lower quality in other aspects of their design

Vibration of Effects Vibration of effects refers to the situation in which a study obtains different estimates of the magnitude of the effect depending on the analytical options it implements. These options could include the statistical model, the definition of the variables of interest, the use (or not) of adjustments for certain potential confounders but not others, the use of filters to include or exclude specific observations and so on. This is more often the case for small studies Results can vary markedly depending on the analysis strategy when power is low because of small N

Publication Bias and Selective Reporting
Publication bias and selective reporting of outcomes and analyses are more likely to affect smaller, underpowered studies Smaller studies “disappear” into a file drawer (larger studies are known and anticipated). Larger null results may be published b/c low power isnt a viable explanation for null effect. The protocols of large (clinical) studies are more likely to have been registered or otherwise made publicly available. Deviations in the analysis plans and choice of outcomes may be more obvious

Smaller studies may have a worse design quality than larger studies
Smaller studies may have a worse design quality than larger studies. Small studies may be opportunistic, quick and dirty experiments. Data collection and analysis may have been conducted with little planning. Large studies often require more funding and personnel resources. Designs are examined more carefully before data collection, and analysis and reporting may be more structured.

Commitment to Research Transparency and Open Science

Method Transparency and Sharing
The 21 Word Solution Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2012). A 21 word solution. Retrieved from: We have adopted the their recommendation to report 1) how we determined our sample size, 2) all data exclusions, 3) all manipulations, and 4) all study measures. We will include the following brief (21 word) statement in all papers we submit for publication: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”

I have three additional requests that follow from the increased efforts directed at transparency and open science more generally in our field. First, I request that you add a statement to the paper confirming that you have reported all measures, conditions, data exclusions, and how you determined the sample size. You should, of course, add any additional text to ensure this statement is accurate. This is the standard disclosure request endorsed by the Center for Open Science [see

Second, as you likely know, substantial concerns have been raised in recent years about the impact of researcher degrees of freedom in data processing, analysis and selective reporting of outcomes on the validity of our statistic inference based on p-values. I saw nothing to indicate that study hypotheses, data processing decisions, and proposed analysis were pre-registered to increase confidence that researcher degrees of freedom were reduced or removed. I would ask that you add a statement to the manuscript to confirm that the study was not pre-registered and provide the rationale for why it was not.

Third, the ability to evaluate (and associated confidence in) the integrity of the study design and analyses for any manuscript is substantially enhanced when authors make data, measures, and analysis code publicly available, online, hosted by a reliable third party (e.g., and provide a persistent link to these materials in the paper. Data and materials sharing also increases the impact of any study beyond its current findings because such sharing can stimulate and assist further research on the topic. The data and study materials from this study should be shared unless clear reasons (e.g., legal, ethical constraints, or severe impracticality) that prevent sharing are provided.

Pre-registration FAQ: Pre-registration examples

Important Research Degrees of Freedom
IVs and how they are “scored” (levels, processing, etc) DVs and how they are scored Analytic model (additive, interactive, other model features) Covariates and how they are selected Transformations Outliers and influence N

Pre-registration FAQ: Pre-registration examples

Unit 15 Power Analysis and Statistical Validity

Similar presentations

Presentation on theme: "Unit 15 Power Analysis and Statistical Validity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit 15 Power Analysis and Statistical Validity

Similar presentations

Presentation on theme: "Unit 15 Power Analysis and Statistical Validity"— Presentation transcript:

Similar presentations

About project

Feedback