Effect Sizes and Power Review

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Tests of Hypotheses Based on a Single Sample
Chapter 7 Hypothesis Testing
Effect Size Mechanics.
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
Lesson 10: Linear Regression and Correlation
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 21 More About Tests.
Statistical Issues in Research Planning and Evaluation
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
Behavioural Science II Week 1, Semester 2, 2002
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
8-2 Basics of Hypothesis Testing
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Determining the Size of
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 21 More About Tests.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Hypothesis Testing:.
More to life than statistical significance Reporting effect size
Overview of Statistical Hypothesis Testing: The z-Test
Lecture Slides Elementary Statistics Twelfth Edition
Overview Definition Hypothesis
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
More About Tests and Intervals Chapter 21. Zero In on the Null Null hypotheses have special requirements. To perform a hypothesis test, the null must.
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 8 Introduction to Hypothesis Testing
LECTURE 19 THURSDAY, 14 April STA 291 Spring
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Copyright © 2009 Pearson Education, Inc. Chapter 21 More About Tests.
Effect Size Estimation in Fixed Factors Between-Groups ANOVA
Effect Size Estimation in Fixed Factors Between- Groups Anova.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 4): Power Fall, 2008.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.
Issues concerning the interpretation of statistical significance tests.
AP Statistics Section 11.1 B More on Significance Tests.
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Chapter 13 Understanding research results: statistical inference.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 9 Introduction to the t Statistic
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Introduction to Power and Effect Size  More to life than statistical significance  Reporting effect size  Assessing power.
15 Inferential Statistics.
INF397C Introduction to Research in Information Studies Spring, Day 12
Chapter 21 More About Tests.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Central Limit Theorem, z-tests, & t-tests
Gerald Dyer, Jr., MPH October 20, 2016
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Presentation transcript:

Effect Sizes and Power Review

Statistical Power Statistical power refers to the probability of finding a particular sized effect Specifically, it is 1- type II error rate Probability of rejecting the null hypothesis if it is false It is a function of type I error rate, sample size, and effect size Its utility lies in helping us determine the sample size needed to find an effect size of a certain magnitude

Two kinds of power analysis A priori Used when planning your study What sample size is needed to obtain a certain level of power? Post hoc Used when evaluating study What chance did you have of significant results? Not really useful If you do the power analysis and conduct your analysis accordingly then you did what you could. To say after, “I would have found a difference but didn’t have enough power” isn’t going to impress anyone.

A priori power Can use the relationship of n, d, and d (the noncentrality parameter, i.e. what the sampling distribution is centered on if H0 is false) plus our specified  to calculate how many subjects we need to run Decide on your a level Decide an acceptable level of power/type II error rate Figure out the effect size you are looking for Calculate n

A priori Effect Size? Figure out an effect size before I run my experiment? Several ways to do this: Base it on substantive knowledge What you know about the situation and scale of measurement Base it on previous research Use conventions

An acceptable level of power? Why not set power at .99? Practicalities Howell shows how for a 1 sample t test, and an effect size d of 0.33: Power = .80, then n = 72 Power = .95, then n = 119 Power = .99, then n = 162 Cost of increasing power (usually done through increasing n) can be high

Howell’s general rule Look for big effects or Use big samples You may now start to understand how little power many of the studies in psych have considering they are often looking for small effects Many seem to think that if they use the central limit theorem rule of thumb (n=30), which doesn’t even hold that often, that power is solved too This is clearly not the case

Post hoc power: the power of the actual study If you fail to reject the null hypothesis might want to know what chance you had of finding a significant result – defending the failure As many point out this is a little dubious One thing we can understand regarding the power of a particular study at hand is that it can be affected by a number of issues such as Reliability of measurement An increase in reliability can actually result in power increasing or decreasing as we will see later, though here I stress the decrease due to unreliable measures Outliers Skewness Unequal N for group comparisons The analysis chosen

Something to consider Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if ever do the data or circumstances bare out such that it provides a perfect estimate for our needs Mike’s sample size calculation for all studies: The sample size needed is the largest N you can obtain based on practical considerations (e.g. time, money) Also, even the useful form of power analysis (for sample size calculation) involves statistical significance as its focus While it gives you something to shoot for, our real interest regards the effect size itself and how comfortable we are with its estimation Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem

Always a relationship Commonly define the null hypothesis as ‘no difference’ or ‘no relationship’ There is always a non-zero relationship (to some decimal place) seen in sample data As such obtaining statistical significance can be seen as just a matter of sample size Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)

What should we be doing? Want to make sure we have looked hard enough for the difference – power analysis Figure out how big the thing we are looking for is – effect size

Effect Size There are different ways to speak about the relationship between variables, but in general effect size refers to the practical, rather than statistical, significance This is what we are really interested in No one cares about the statistical particulars if the effect is real and will change the way we think about things and how we act However, the effect size, like our other measures, varies from sample to sample I.e. if we did a study 5 times, we would get 5 different effect sizes So while we are primarily interested in effect size, we will need to be cautious in our interpretation there too, and use other available evidence also to come to our final conclusions

Calculating effect size Different statistical tests have different effect sizes developed for them However, the general principle is the same Effect size refers to the magnitude of the impact of the independent variable (factor) on the outcome variable

Thinking about effect size again d family: Focused on standardized mean differences Allows comparison across samples and variables with differing variance Equivalent to z scores Note sometimes no need to standardize (units of the scale have inherent meaning) r family: Variance-accounted-for Amount of variance explained versus the total d family and r family

Example: Cohen’s d – Differences Between Means Used with independent samples t test Cohen initially suggested could use either sample standard deviation, since they should both be equal in the population according to our assumptions. In practice people now use the pooled variance. Variations of this are for control group settings, dependent samples, more than two groups… but the notion of standardized mean difference is the same

Cohen’s d – Differences Between Means Relationship to t Relationship to rpb P and q are the proportions of the total each group makes up. If equal groups p=.5, q=.5.

Characterizing effect size Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects Even though rules of thumb exist, use only as a last resort and be wary of “mindlessly invoking” these criteria

Association A measure of association describes the amount of the covariation between the independent and dependent variables It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size We can apply the measure to continuous data(r and R2), categorical predictors with continuous DV (eta2), and strictly categorical settings (e.g. phi) Again the notion is the same, a measure of linear association which, if squared, provides a measure of variance in the DV can be accounted for by the predictor

Case-level effect sizes for group differences Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only However, it is often of interest to estimate differences at the case level Case-level indexes of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point Examples Cohen’s Us, common language effect size, tail ratios Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test) Note that all three effect size types applicable to the group difference setting are transferable to the other, it is just a matter of preference as to which one we use for communication

Confidence Intervals for Effect Size Effect size statistics such as Cohen’s d and η2 have complex distributions General form is the same as any CI

Confidence Intervals for Effect Size Traditional methods of interval estimation rely on approximate standard errors assuming large sample sizes We need a computer program to help us find the correct noncentrality parameters to use in calculating exact confidence intervals for effect sizes Both standalone programs (Steiger) and statistical packages (R) can do this for us, and thus provide a measure of effect while noting the uncertainty with that estimate

Limitations of effect size measures Variability across samples No more a limitation than other statistics, but one needs to be fully aware of this Just because you found a moderate effect doesn’t mean that there is one Standardized mean differences: Heterogeneity of within-conditions variances across studies can limit their usefulness—the unstandardized contrast may be better in this case Measures of association: Correlations can be affected by sample variances and whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections) Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance

Limitations of effect size measures How to fool yourself with effect size estimation: 1. Measure effect size only at the group level 2. Apply generic definitions of effect size magnitude without first looking to the literature in your area 3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant 4. Ignore the question of how theoretical or practical significance should be gauged in your research area 5. Estimate effect size only for statistically significant results

Limitations of effect size measures 6. Believe that finding large effects somehow lessens the need for replication 7. Forget that effect sizes are subject to sampling error 8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study 9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design 10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published

Recommendations Report effect sizes along with statistical significance Report confidence intervals Use graphics Use common sense combined with theoretical considerations Do not rely on any one result to support your conclusions