Statistical power Learning outcomes

Slides:



Advertisements
Similar presentations
Lecture 3: Null Hypothesis Significance Testing Continued Laura McAvinue School of Psychology Trinity College Dublin.
Advertisements

RIMI Workshop: Power Analysis Ronald D. Yockey
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Lecture 9: One Way ANOVA Between Subjects
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE © 2012 The McGraw-Hill Companies, Inc.
PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 11: Power.
Today Concepts underlying inferential statistics
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Osama A Samarkandi, PhD-RN, NIAC BSc, GMD, BSN, MSN.
The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
AM Recitation 2/10/11.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Department of Cognitive Science Michael J. Kalsher PSYC 4310 COGS 6310 MGMT 6969 © 2015, Michael Kalsher Unit 1B: Everything you wanted to know about basic.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Introduction to Statistics Osama A Samarkandi, PhD, RN BSc, GMD, BSN, MSN, NIAC Deanship of Skill development Dec. 2 nd -3 rd, 2013.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Descriptive and Inferential Statistics Descriptive statistics The science of describing distributions of samples or populations Inferential statistics.
Chapter 13 Understanding research results: statistical inference.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 9 Introduction to the t Statistic
Introduction to Power and Effect Size  More to life than statistical significance  Reporting effect size  Assessing power.
Methods of Presenting and Interpreting Information Class 9.
P value and confidence intervals
Logic of Hypothesis Testing
Dependent-Samples t-Test
How many study subjects are required ? (Estimation of Sample size) By Dr.Shaik Shaffi Ahamed Associate Professor Dept. of Family & Community Medicine.
Statistical Core Didactic
Hypothesis Testing.
Inference and Tests of Hypotheses
Hypothesis Testing and Confidence Intervals (Part 2): Cohen’s d, Logic of Testing, and Confidence Intervals Lecture 9 Justin Kern October 17 and 19, 2017.
Understanding Results
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
“It is better to observe than to criticise.”
Hypothesis Testing Is It Significant?.
Calculating Sample Size: Cohen’s Tables and G. Power
Multiple Choice Review 
Review: What influences confidence intervals?
Statistical Inference about Regression
Elements of a statistical test Statistical null hypotheses
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
Psych 231: Research Methods in Psychology
Chapter 12 Power Analysis.
Psych 231: Research Methods in Psychology
What are their purposes? What kinds?
Psych 231: Research Methods in Psychology
Inferential Statistics
15.1 The Role of Statistics in the Research Process
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Chapter Nine: Using Statistics to Answer Questions
Testing Hypotheses I Lesson 9.
Type I and Type II Errors
Statistical Power.
MGS 3100 Business Analysis Regression Feb 18, 2016
Statistical Power And Sample Size Calculations
Presentation transcript:

Statistical power Learning outcomes Distinguish between different types of error in statistical inference Define effect sizes for common statistical tests Describe the principles of statistical power analysis

Hypothesis-testing Hypothesis-testing is a process designed to evaluate competing explanations for a phenomenon Based on the work of Fisher and Neyman-Pearson Influenced by Karl Popper

Fisher Gave us ANOVA and various other statistical tests Statistician in the field of agronomy For example, effect of fertiliser on wheat Argued for NHST Constructed testable and specific null hypotheses Suggested a 1 in 20 (p < 0.05) threshold to accept or reject null-hypothesis

Neyman-Pearson Fisher had a long standing feud with Karl Pearson, who worked in the same field Pearson coined the term standard deviation, and introduced tests of correlation, and chi-square Pearson’s son, Egon collaborated with Jerzy Neyman on hypothesis-testing Proposed decision about alternative (experimental or research) hypothesis Appropriate in process-control settings

Null-hypothesis significance testing (NHST) We wish to test an experimental hypothesis For example, cognitive-behavioural treatment for chronic pain is more effective than a control We obtain a random sample of participants We construct a null hypothesis that the two population means are identical (Note. This is just an example.) We measure the difference between the experimental and control group We calculate the probability of finding the measured difference between groups If this passes a set threshold (usually 1 in 20) then we reject the null hypothesis

How do we interpret p? You run a t test, which gives you a p value of 0.43. Which is the correct interpretation? (a) Given the data, the probability that the null hypothesis is true is 0.57 (b) Given the data, the probability that the experimental hypothesis is true is 0.57 (c) Given the data, the probability that the experimental hypothesis is true is 0.43 (d) If the null hypothesis is true, then the probability of getting these results is 0.43 (e) If the null hypothesis is false, then the probability of getting these results is 0.57 (f) If the null hypothesis is false, then the probability that the experimental hypothesis is true is 0.43

The correct answer is: ….. Misinterpretation of hypothesis-testing is widespread and pervasive

Non-rejection of the null hypothesis We now know about the rejection of the null hypothesis if the null hypothesis is true, then the probability of getting these results is  .05 the null hypothesis is rejected But what happens if p > .05 do we ... accept the null hypothesis? neither accept nor reject the null hypothesis? A statistically non-significant result does not allow us to make a decision

Hypothesis-testing in applied psychology IQ scores linked to height 14000 children tested Controlled for age and sex, socio-economic status, birth-order and family size Statistically significant result Authors speculate that smaller children may be treated as less mature What additional information do we need in order to evaluate the importance of this claim?

Cohen (1990) With 14000 cases a correlation of only 0.0278 is statistically significant (p < 0.001) This only accounts for 0.077% of variance in the sample We need to understand the effect size Effect size (ES) is a measure of the amount of variance explained by a test result

Close your eyes and ... Imagine that you are a counseling psychologist. You are interested in the effectiveness of an intervention aimed at raising the morale of carers working with older adults. You compare morale scores of a control group with a group which has received the intervention. What possible outcomes are there?

Possible outcomes (1) There is a statistically significant difference between the groups and we reject the null hypothesis (i) This conclusion is correct because in reality your intervention is beneficial (ii) This conclusion is incorrect and your intervention is no more effective than the control treatment

Possible outcomes (2) There is no statistically significant difference between the groups and we do not reject the null hypothesis (i) This conclusion is correct because in reality the intervention is no more effective than the control treatment (ii) This conclusion is incorrect because your intervention is more effective than the control, but your study did not detect it

What kinds of error are 1(ii) and 2(ii)? Which kind of error is the more important one to avoid in psychological research? 1 (ii) is termed a Type I error 2 (ii) is termed a Type II error

Type I and Type II errors are mutually exclusive The probability of making a Type I error is referred to as alpha () The probability of making a Type II error is referred to as beta () A goal of good research is finding a way to reduce the probability of making a Type I and a Type II error. How can we reduce these errors?

Reducing Type I errors Reduce the rejection level (alpha), for example from .05 to .01 There are cases when this is justified (when you are running a large number of tests) But, generally it is not a good idea

Reducing Type II errors    depends on a number of factors ·  : if we reduce the probability of making a Type I error, the probability of making a Type II error increases. ·  Sample size: an increase in sample size reduces the probability of a Type II error ·  Effect size: the greater the experimental effect, the lower the probability of making a Type II error

Power The probability of rejecting the null hypothesis when it is false is 1 - , and this is referred to as the power of the study: the probability that you will detect an experimental effect If our study has a large effect size then this will increase its power Standardised effect sizes are independent of sample size Example: the number of standard deviations that two means differ What measures of effect size are there?

Measures of effect size (i)          t test for difference Cohen's d, r between two means (ii)         Pearson’s correlation r2 or r (iii)        chi-square w (iv)        ANOVA 2, f, 2 , 2 (v)         multiple regression R2 (vi)        logistic regression odds ratio, RL2

How do you calculate an effect size? Difference between means d = mean(A) - mean(B) s   So imagine that you have compared the means of two groups on IQ scores. One group scores 90 and the other group 100. What is the value of d? (remember IQ has a SD of 15) What does this value actually mean? r = sqrt(t2/(t2 + df)), t and df from t test results

Correlation and multiple regression For correlation effect size is calculated by r2 or r r2 (x100) is very informative because it tells you the amount of variability in one variable which is attributable to the other variable. E.g. suppose we correlate height and weight and find a r = +0.6 therefore r2 is 0.36 that is to say that for our sample 36% of the variability in our participants' weight is attributable to their height. A similar conclusion can be drawn about R2 in multiple regression

ANOVA 2 = SSFactor 1/SSTotal = 61.3/118.0 df SS MS F p Factor 1 1 61.3 37.354 37.354 .0001 Factor 2 2 5.6 2.8 1.697 NS F1 x F2 2 11.467 5.733 3.475 .0473 Error 24 39.6 1.65 Total 29 118.0 2 = SSFactor 1/SSTotal = 61.3/118.0   = 0.52 or (x100) 52% of the variability in scores  Cohen uses f, a measure in terms of SDs (like d) (not to be confused with F); f = sqrt(2 /(1 - 2 )) and 2 = f2 / (1 + f2) 2 is less biased, but conservative (and a more accurate measure of effect size for ANOVA): 2 = (SSFactor - (k - 1)MSerror)/(SSTotal + MSerror), k = number of levels of the independent variable 2 is an approximately unbiased estimate of the (population) parameter for proportion of explained variance: 2 = (SSFactor - (k - 1)MSerror)/SSTotal

Statistical power Cohen (1962) reviewed published research, and found that the average power in reported studies was only .48 (i.e. that there was only a 48% of successfully rejecting the null hypothesis) given a medium effect size Sedlmeier and Gigerenzer (1989), in an updated review, claimed that the situation was becoming worse, not better Maxwell (2004) found that psychological research studies remain underpowered Cohen (1992) offered some simple guidelines for calculating sample size, aimed at producing studies with a power of .80

Prospective statistical power (a priori power analysis) We can use our calculation for effect size to predict how many participants we need to test Suppose that we are interested in examining the IQ of 2 different groups (A and B) and we predict that the difference will be 10 IQ points. We know that  = 15. How many participants do we need in each group? First we need to calculate the effect size

What is the effect size? Cohen (1992) helps us out here   Effect Size small medium large d .20 .50 .80 so an effect size of .67 for d is (between) a medium (and large) effect

so we need 64 in each group to have a power of .80 we can now use this to work out how many participants we need in our sample    = .05 t test small medium large   393 64 26 so we need 64 in each group to have a power of .80 More accurate is to conduct a power analysis using (a) power tables (e.g. Cohen, 1988; Clark-Carter, 1997, 2004, 2010) or (b) power analysis software (SamplePower, GPower)

But, how do we know what our effect size will be for a study that has not yet been conducted? examine previous research as a guide to effect sizes calculate an effect size from a pilot study if all else fails, decide beforehand what effect size you want to detect, based on conventions for effect sizes (Cohen, 1988)

Retrospective statistical power (post-hoc power analysis) Sometimes the results of our studies do not yield any significant results We can determine how likely a significant effect would have been, given our number of participants and a hypothesised psychologically meaningful effect size (i.e. a difference which researchers/practitioners would be interested in) This will tell us whether it is worthwhile to run an additional study

Alternatives Counter-null statistic (Rosenthal & Rubin, 1994) non-null magnitude effect size that has the same p-value as the null value of the effect size Related to effect size Prep (Killeen, 2005) probability of replicating the obtained result related to NHST p intervals (Cummings, 2008) Show the unreliability of p values Provide another justification for the use of confidence intervals

Alternatives (2) Magnitude-based inference (Batterham & Hopkins, 2006) (Power analysis for) minimum-effect tests (Murphy & Myors, 1999; Murphy et al., 2009) The nil hypothesis is almost always wrong Minimum-effect tests are alternatives to traditional hypothesis-testing Test the hypothesis that treatment effects are negligible Use one-stop tables or one-stop calculator for minimum-effect tests Magnitude-based inference (Batterham & Hopkins, 2006) Takes into account the smallest important effect in making inferences Uses qualitative descriptors in inference Mechanistic and clinical (practical) inference

Preparation for next week Study statistical power Lecture notes Further reading (see module guide) Practical exercises

Summary Stated the characteristics of null hypothesis significance testing (NHST) Defined Type I and Type II errors Stated the factors affecting statistical power Summarised the major measures of effect size Illustrated prospective power analysis Summarised alternatives to NHST

Bibliography Alternatives Batterham, A.M., & Hopkins, W.G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50-57. Buchheit, M. (2016). The numbers will love you back in return—I promise. International Journal of Sports Physiology and Performance, 11, 551 -554 Cummings, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. Hopkins, W. G., & Batterham, A. M. (2016). Error rates, decisive outcomes and publication bias with several inferential methods. Sports Medicine, 46(10), 1563-1573. doi:10.1007/s40279-016-0517-x Killeen, P. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345-353. Murphy, K.R., & Myors, B. (1999). testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. Journal of Applied Psychology, 84(2), 234-248. Murphy, K.R., Wolach, A.H., Myors, B. (2009).Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (3rd ed). London, Routledge. Rosenthal, R. & Rubin, D. (1994). The counternull value of an effect size. Psychological Science, 5, 329-334. Schaik P., van & Weston M. (2016). Magnitude-based inference and its application in user research. International Journal of Human-Computer Studies, 88, 38-50. doi:10.1016/j.ijhcs.2016.01.002

Bibliography (continued) Power analysis and effect size Clark-Carter, D. (2009). Quantitative psychological research: the complete student's companion. Hove: Psychology Press. Cohen, J. (1988.) Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cohen, J (1990). Things I have learned (so far). American Psychologist. 45, 1304-1312. Cohen (1992) A power primer. Psychological Bulletin, 112, 155-159. Cohen, J (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. Faul, F., Erdfelder, E., Lang, A.-L. & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175-191. Hopkins, W. (2006). Estimating Sample Size for Magnitude-Based Inferences. Sportscience, 10, 63-70. Jaccard, J. (1998). Interaction effects in factorial analysis of variance. Thousand Oaks, CA: Sage. Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147–163. Murphy, K., Myors, B. & Wolach, A. (2009). Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (3rd ed.). London: Routledge.