Unit 15 Power Analysis and Statistical Validity

Slides:

Advertisements

Similar presentations

Introduction to Hypothesis Testing

Advertisements

Statistical Issues in Research Planning and Evaluation

Lecture 10 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D

Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.

Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~

C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Independent and Dependent Variables Between and Within Designs.

Inferences About Process Quality

PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.

PSY 307 – Statistics for the Behavioral Sciences

The t Tests Independent Samples.

Sample Size and Statistical Power Epidemiology 655 Winter 1999 Jennifer Beebe.

Chapter 14 Inferential Data Analysis

Richard M. Jacobs, OSA, Ph.D.

Inferential Statistics

Chapter 12 Inferential Statistics Gay, Mills, and Airasian

AM Recitation 2/10/11.

ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?

Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.

HYPOTHESIS TESTING Dr. Aidah Abu Elsoud Alkaissi

1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.

Chapter 8 Introduction to Hypothesis Testing

Comparing Means From Two Sets of Data

Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.

Chapter 8 Introduction to Hypothesis Testing

Chapter 9 Power. Decisions A null hypothesis significance test tells us the probability of obtaining our results when the null hypothesis is true p(Results|H.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,

Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.

Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.

Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc.

© Copyright McGraw-Hill 2004

Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.

European Patients’ Academy on Therapeutic Innovation The Purpose and Fundamentals of Statistics in Clinical Trials.

Chapter 13 Understanding research results: statistical inference.

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Inferential Statistics Psych 231: Research Methods in Psychology.

April Center for Open Fostering openness, integrity, and reproducibility of scientific research.

Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 11: Between-Subjects Designs 1.

Unit 23: Power Analysis and Statistical Validity.

Logic of Hypothesis Testing

Psych 231: Research Methods in Psychology

The ability to find a difference when one really exists.

Hypothesis Testing.

Hypothesis Testing: One Sample Cases

INF397C Introduction to Research in Information Studies Spring, Day 12

Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.

Understanding Results

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

© LOUIS COHEN, LAWRENCE MANION AND KEITH MORRISON

Hypothesis Testing Is It Significant?.

Central Limit Theorem, z-tests, & t-tests

Hypothesis Testing: Hypotheses

Calculating Sample Size: Cohen’s Tables and G. Power

I. Statistical Tests: Why do we use them? What do they involve?

Hypothesis Testing.

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

What are their purposes? What kinds?

Inferential Statistics

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Testing Hypotheses I Lesson 9.

Chapter 9 Hypothesis Testing: Single Population

Reasoning in Psychology Using Statistics

Type I and Type II Errors

Presentation transcript:

Unit 15 Power Analysis and Statistical Validity

Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR (1-) True Negative TYPE 2 ERROR () False Negative Researcher concludes: REJECT NULL; TYPE 1 ERROR () False Positive CORRECT REJECT (1-) True Positive

Power: The probability of rejecting the null hypothesis when it is false. symbol is 1 - , where  is the probability of making a Type II error--failing to reject the null hypothesis when you should In everyday language, power is the probability of concluding that variables are related (or groups differ) on the basis of your sample, when the variables actually are related (or groups differ) in the population.

What Determines Power? Some statistical tests are more powerful (i.e., better at detecting real/non-zero population effects) than others. Parametric tests often are more powerful than non-parametric, because they work with more information from the data. GLM is MVUE when assumptions are met

What Factors Affect Power in GLM? 1. Alpha (significance level): As alpha increases (e.g., goes from .01 to .05, power increases--with less stringent alpha, easier to reject the null hypothesis of no relation/difference 2. Sample size: As N increases, power increases--with larger n, standard error of parameter estimates are reduced, so small, non-zero effects are less likely to be due to sampling error. “Critical value” also gets larger 3. Magnitude of effect in the population: As magnitude of the effect in the population goes up, power increases– easier to determine that population effect <> 0 if it is bigger because sample bs will be bigger on average 4. Number of parameters in Model A (and effect): As either the model get more complex with more parameters (> PA) or the effect gets more complex (> PA – PC), power goes down because SE gets larger. Critical value also gets larger. 5. Error (unexplained variance in Y): As SSE(A) goes up (and model A R2 goes down) power is lower– larger error means larger standard error of parameter estimates. More difficult to tell if effects <> 0 because bs vary more greatly from sample to sample (and from zero) by chance.

Power Conventions Desired level of power: depends on purpose, but usually the more the better. The value of .80 has become a minimum threshold standard (much like alpha= .05 for significance testing. Higher power also means more precision in estimating that magnitude of the effect (tighter CI) Two strategies: Determine number of subjects needed (N*) for given level of power (e.g. .80) Determine power for a given design (e.g., completed experiment with fixed N)

Power can be calculated for tests of Effect for single regressor, subset of regressors controlling for other regressors in model, or all regressors in the model. A priori power analysis (for sample size planning) Must set alpha level, PA, PC, and desired power Specify (calculate) effect size (e.g., partial eta2 or R2 ) you expect. Be somewhat conservative. N will be a function of the above factors. Post hoc power analysis (WHEN?) Must set alpha level, PA, PC, and N. Specify minimum effect size of interest. Be VERY conservative Power will be a function of above factors

Effect size: Partial Eta2 You can calculate partial eta-squared for any reported effect Partial eta-squared = F * ndf (F * ndf + ddf) Where ndf and ddf are numerator and denominator degrees of freedom from F statistic

Cohen (1992) Rules of Thumb for partial Eta2 The values of partial eta2 that Cohen implicitly defines as small, medium and large vary across research context/statistical test MR ANOVA t-test r Small: .02; (R2 = .02) .01 (f = .10) .01 (d = .20) .01 (r = .10) Medium: .13; (R2 = .13) .06 (f = .25) .06 (d = .50) .09 (r = .30) Large: .26; (R2 = .26) .14 (f = .40) .14 (d = .80) .25 (r = .50) Use MR effect sizes if you are testing model R2 or a set (> 1) of quantitative regressors. Use ANOVA/t effect sizes if you are testing one categorical variable Use r effect sizes if you are testing one quantitative variable

A Priori Power Analysis for One Parameter in MR How many subjects are needed for 80% power to detect an partial eta-squared of .09 for one predictor in a model with 5 predictors at an alpha of 0.05? > modelPower(pa=6, pc=5, power=.80, peta2=.09) Results from Power Analysis pEta2 = 0.090 pa = 6 pc = 5 alpha = 0.050 N = 85.327 power = 0.800

A Priori Power Analysis for One Parameter in MR How many subjects are needed for 85% power to detect an ΔR2 of .05 for one predictor in a model with 3 predictors and an R2 of .30 at an alpha of 0.05? > modelPower(pa=4 , pc=3, power=.85, dR2=.05, R2=.30) Results from Power Analysis dR2 = 0.050; R2 = 0.300 pa = 4 pc = 3 alpha = 0.050 N = 129.648 power = 0.850 modelPower() also supports f2 as effect size

Hefner et al (2013): Duration task

A Priori Power Analysis for One Parameter in MR How many subjects are needed to detect the Beverage Group X Cue type interaction based on the effect size observed in Hefner et al. with 90% power and an alpha of .05 pEta2 = (F * ndf) / (F * ndf + ddf) = (3.452 * 1) / (3.452 * 1 + 64) = .157 > modelPower(pa=4 , pc=3, power=.90, peta2=.157) Results from Power Analysis pEta2 = 0.157 pa = 4 pc = 3 alpha = 0.050 N = 60.408 power = 0.900

Post Hoc Power Analysis for One Parameter in MR How much power did Hefner et al have to detect a moderate effect size (partial eta-squared of .06), with alpha = .05 > modelPower(pc=3, pa=4, N=68, peta2=.06) Results from Power Analysis pEta2 = 0.060 pa = 4 pc = 3 alpha = 0.050 N = 68.000 power = 0.525

Problems from Low Power Low probability of finding true effects Low positive predictive value (PPV) 3. An exaggerated estimate of the magnitude of the effect when a true effect is discovered

Miss True Effects Low power, by definition, means that the chance of discovering effects that are genuinely true is low. If study has 20% power and there really is an effect, you only have a 20% chance of finding it. Low-powered studies produce more false negatives (misses) than high-powered studies. We tend to focus on false alarms but misses can be equally costly (e.g., new treatments) You waste resources (time, money) with low powered studies. One high power study (N) >>> 2 low power (N/2)

Low Positive Predictive Value (PPV) The lower the power of a study, the lower the probability that an observed “significant” effect (among the set of all significant effects) actually reflects a true non-zero effect in the population (vs. a false alarm). Called the Positive Predictive Value (PPV) of a claimed discovery. If alpha = .05, what is PPV? You need more information. Probably not 95%!!!

Low Positive Predictive Value (PPV) PPV = ([1 – ] × OR) ⁄ ([1− ] × OR + ) where: (1 − β) is the power, β is the type II error, α is the type I error. OR is the pre-study odds ratio (that is, the odds that an effect is indeed non-null among the effects being tested in a field or other set). Formula can be rewritten: PPV = (Power × OR) ⁄ (Power × OR + )

100 95 20 5 80 A priori odds ratio that effect exists: 1 (1:1 or 1/2) Power 80% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1 – ] × OR + ) PPV = ([1−.20] × 1) ⁄ ([1−.20] × 1 + .05) = .80 / .85 = .94 100 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 95 20 5 80

100 95 80 5 20 A priori odds ratio that effect exists: 1 (1:1 or 1/2) Power 20% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1 – ] × OR + ) PPV = ([1−.80] × 1) ⁄ ([1−.80] × 1 + .05) = .20 / .25 = .80 100 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 95 80 5 20

A priori odds ratio that effect exists: .25 (1:4 or 1/5) Power 20% Simulate: 200 studies PPV = ([1 – ] × OR) ⁄ ([1− ] × OR + ) PPV = ([1−.80] × .25) ⁄ ([1−.80] × .25 + .05) = .05 / .10= .50 160 40 Reality: NO EFFECT EFFECT EXISTS Research concludes: FAIL TO REJECT NULL; CORRECT FTR TYPE 2 ERROR () Researcher concludes: REJECT NULL; TYPE 1 ERROR () CORRECT REJECT (1-) 152 32 8 8

Sample estimate of effect is too large When an underpowered study discovers a true effect, it is likely that the estimate of the magnitude of that effect provided by that study will be exaggerated. Effect inflation is worst for small, low-powered studies, which can only detect sample parameter estimates effects that happen to be large. Why does this make sense? Consider example where = 5, SE = 4. Draw sampling distribution. Indicate region where sample bs would be significant. What can we conclude about estimates (b) for  in any one study or across a bunch of studies.

Other Biases that often come with Low Power Vibration of effects Publication bias, selective data analysis and selective reporting of outcomes. May be of lower quality in other aspects of their design

Vibration of Effects Vibration of effects refers to the situation in which a study obtains different estimates of the magnitude of the effect depending on the analytical options it implements. These options could include the statistical model, the definition of the variables of interest, the use (or not) of adjustments for certain potential confounders but not others, the use of filters to include or exclude specific observations and so on. This is more often the case for small studies Results can vary markedly depending on the analysis strategy when power is low because of small N

Publication Bias and Selective Reporting Publication bias and selective reporting of outcomes and analyses are more likely to affect smaller, underpowered studies Smaller studies “disappear” into a file drawer (larger studies are known and anticipated). Larger null results may be published b/c low power isnt a viable explanation for null effect. The protocols of large (clinical) studies are more likely to have been registered or otherwise made publicly available. Deviations in the analysis plans and choice of outcomes may be more obvious

Smaller studies may have a worse design quality than larger studies Smaller studies may have a worse design quality than larger studies. Small studies may be opportunistic, quick and dirty experiments. Data collection and analysis may have been conducted with little planning. Large studies often require more funding and personnel resources. Designs are examined more carefully before data collection, and analysis and reporting may be more structured.

Commitment to Research Transparency and Open Science http://www.researchtransparency.org/

Method Transparency and Sharing The 21 Word Solution Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2012). A 21 word solution. Retrieved from: http://dx.doi.org/10.2139/ssrn.2160588 We have adopted the their recommendation to report 1) how we determined our sample size, 2) all data exclusions, 3) all manipulations, and 4) all study measures. We will include the following brief (21 word) statement in all papers we submit for publication: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”

I have three additional requests that follow from the increased efforts directed at transparency and open science more generally in our field. First, I request that you add a statement to the paper confirming that you have reported all measures, conditions, data exclusions, and how you determined the sample size. You should, of course, add any additional text to ensure this statement is accurate. This is the standard disclosure request endorsed by the Center for Open Science [see http://osf.io/hadz3].

Second, as you likely know, substantial concerns have been raised in recent years about the impact of researcher degrees of freedom in data processing, analysis and selective reporting of outcomes on the validity of our statistic inference based on p-values. I saw nothing to indicate that study hypotheses, data processing decisions, and proposed analysis were pre-registered to increase confidence that researcher degrees of freedom were reduced or removed. I would ask that you add a statement to the manuscript to confirm that the study was not pre-registered and provide the rationale for why it was not.

Third, the ability to evaluate (and associated confidence in) the integrity of the study design and analyses for any manuscript is substantially enhanced when authors make data, measures, and analysis code publicly available, online, hosted by a reliable third party (e.g., https://osf.io/), and provide a persistent link to these materials in the paper. Data and materials sharing also increases the impact of any study beyond its current findings because such sharing can stimulate and assist further research on the topic. The data and study materials from this study should be shared unless clear reasons (e.g., legal, ethical constraints, or severe impracticality) that prevent sharing are provided.

Method Transparency and Sharing https://osf.io/ykmuh/ Pre-registration FAQ: https://www.psychologicalscience.org/observer/research-preregistration-101 Pre-registration examples https://osf.io/ukhcf/register/564d31db8c5e4a7c9694b2be https://osf.io/m8jmp/register/564d31db8c5e4a7c9694b2c0

Important Research Degrees of Freedom IVs and how they are “scored” (levels, processing, etc) DVs and how they are scored Analytic model (additive, interactive, other model features) Covariates and how they are selected Transformations Outliers and influence N

Method Transparency and Sharing https://osf.io/ykmuh/ Pre-registration FAQ: https://www.psychologicalscience.org/observer/research-preregistration-101 Pre-registration examples https://osf.io/ukhcf/register/564d31db8c5e4a7c9694b2be https://osf.io/m8jmp/register/564d31db8c5e4a7c9694b2c0