Download presentation
1
Null Hypothesis Signficance Testing
Consider the general approach and associated problems
2
Some thoughts “Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (Schmidt & Hunter, 1997, p. 37). “The almost universal reliance on merely refuting the null hypothesis is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1978, p. 817). “A potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring” (Meehl again) Cohen (1994) suggested that Statistical Hypothesis Inference Testing produces a more appropriate acronym. What is NHST? What isn’t it? And what is the problem?
3
What is hypothesis testing about?
Using an inferential procedure to examine the credibility of a hypothesis about a population We start with a research question, develop specific hypotheses to test, collect the data and then use statistical analysis to test them But what exactly is this analysis we use?
4
How is NHST made possible?
The sampling distribution tells us the degree of variability to expect with regard to some statistic. We can then see whether our sample stat varies greatly from the random error we would expect from sampling from a population with a particular value (point estimate) for that statistic. Example: are the mean of SAT test scores of students from this school all that different from the national average?1 The answer is yes. Which direction do you think that difference goes?
5
Two approaches, many problems
Psychology and other social and behavioral sciences have actually taken two approaches developed independently and put them together in such a way that lead to many problems in interpreting results One approach is data driven (Fisher) The other is design driven (Neyman & Pearson)
6
Fisher vs. Neyman vs. For heavyweight stats champion thingy
7
Fisher’s approach 1) State the research question
2) State the null hypothesis 3) Construct a sampling distribution based on the null hypothesis and calculate the test statistic 4) Note the associated probability of obtaining that test statistic (i.e. p(D|H0) 5) Use the p-value to ascertain whether you will reject the null or come to no conclusion 1. Do guys and gals differ in their assessment of John McCain’s ‘electability’? 2. The difference in mean electability ratings is zero μmale - μfemale = 0 3. Depends on sample size but the t-distribution would be appropriate 4. You found a difference, what is the probability of coming up with that size of a difference if you were expecting no difference? Based on that observed p- value, do you reject or fail to reject?
8
P - value Note that the p-value in this sense is used as a measure of disbelief in the null hypothesis1 Despite the fact it is not a probability as to the likelihood of the null hypothesis It is p(D|Ho) not p(Ho|D) However, it is confounded by sample size and so cannot be considered sufficient evidence against the null by itself All else being equal, increasing N decreases the p-value Central Limit Theorem Unfortunately it often is interpreted without regard to sample size, unless of course the sample size is small “It would have been significant I swear!” 1. If it sounds like a Bayesianesque interpretation to you, you are not alone.
9
Why “fail to reject”? Why not just accept the null hypothesis?
Null hypothesis value is precise (in this case zero), but also arbitrary Lots of data might be apparently consistent with H0 For example, if you obtained a difference of 2 and it was not large enough to reject no difference, how do you know that the true population difference isn’t 1 or any other infinite possibilities that are near zero?1 However you do see researchers accepting the null hypothesis and then drawing conclusions from this Not the way to do things according to Fisher or logic 1. As Tukey has pointed out, it’s never zero.
10
Neyman and Pearson’s approach
1) State the research question 2) State the null hypothesis, and alternative hypothesis1 3) Construct a sampling distribution based on the null hypothesis and locate the region of rejection (i.e. find the critical value on your table) 4) Calculate the test statistic and see where it falls along the distribution 5) Reach a decision to reject the null or retain the null based on whether the test statistic falls in the region of rejection2 1. Do guys and gals differ in their assessment of John McCain’s ‘electability’? 2. Ho = μmale = μfemale H1 = not Ho3 3. Design the study A. Now we have to design our study (i.e. determine sample size) based on a set α level and power desired e.g. α = .05, power .80 (i.e. β = .20) that reflect the error rates we want to maintain Note that this is done before data is collected! B. What is the associated critical value/region of rejection? 4. You found a difference and in this case have an observed t-statistic 5. Is your observed t more extreme than the t critical value (i.e. does it fall into the region of rejection)? YES? Reject the null. NO? Act as if the null were true. Recall that this was not present in the Fisher approach It doesn’t matter what the actual p-value is Note that this is a two-tailed or non-directional alternative hypothesis
11
Key differences Fisher N-P
No alternative hype, no talk of alpha/beta, decision based on the observed p Level of significance p-value How determined Early Fisher: set some acceptable standard, say .05 Later: State exact level as a communication to researchers1 Epistemic interpretation about the likelihood of the null hypothesis (how much do we believe in the false null), p is a property of the data Non-significant result: Do nothing, can’t prove the null can only falsify (to some extent) N-P Alternative hypothesis, alpha/beta/power concerns that determine the design of the study, decision based on observed t-statistic either falls into the region of rejection or doesn’t (if not accept the null)2 ‘Level of significance’ α Must be set before the experiment to interpret it as a long run frequency of error (type I) So now that we have this new sort of thing to worry about (), how do we make it more confusing? Set the standard level at… .05. Behavioristic interpretation (reject or don’t) that refers to repeated experimentation, p is a property of the test /design The actual observed p-value doesn’t matter, our statistic either falls in our region of rejection or doesn’t Non-significant result: Accept null Anyone else find it odd that people don’t follow directions of the manufacturer? Fisher: “Here’s a car, drive it, but make sure it’s got gas”. Psych research: “Yeah, whatever.” The t-statistic falling into the region of rejection is the same as saying the observed p-value is < X. However again, it doesn’t matter what that p-value is, i.e. if it’s .049 or , you reject, and if you play by the rules if it’s .051 you don’t. And if you don’t play by these rules, you defeat the entire purpose of using this approach.
12
Probability: Fisher Probability obtained tell us:
If the null hypothesis were true, this is the probability of obtaining a sample statistic of the kind observed What we want it to mean P(H0|D) We want the p value to be a probability about a hypothesis Some probability of H0 conditional on the data
13
Probability: N-P State of the World H0 true H0 false Research Decision
Reject H0 Type I error Correct rejection Accept H0 Correct Acceptance Type II error State of the World H0 true H0 false Research Decision Reject H0 p=α p=1-β Accept H0 p=1-α p=β
14
Psychology today- the hybrid
“A mishmash of Fisher and Neyman-Pearson, with invalid Bayesian interpretation” Fisher and N-P interpretations of p-value, incorrect inferences about the probabilities of hypotheses or error rates, dogmatic approach to scientific investigation Luckily this is changing, though slowly
15
Cohen’s take: What’s wrong with NHST?
P(D|H0) ≠ P(H0|D) Belief in replicability The Nil Hypothesis Problems!
16
The ‘permanent illusion’
The logic of NHST can be difficult to grasp Deductive side (conditional reasoning) If the null hypothesis is true, this data would not occur The data has occurred The null hypothesis is false This is true by denying the consequent (modus tollens) Unfortunately this is not how hypothesis testing takes place Gives us the illusion of probabilistic “proof”
17
Hypothesis Testing In Its True Form
If the null hypothesis is true, this data would be unlikely The data has occurred The null hypothesis is false The problem is that we make the first statement probabilistic, and that changes everything
18
Hypothesis Testing TRUE
If a person is an American, then he is not a member of Congress FALSE This person is a member of Congress Therefore, he is not an American This is a valid argument but untrue as the first premise is false If a person is an American, then he is probably not a member of Congress TRUE This person is a member of Congress Therefore, he is not an American This is the form of hypothesis testing we undertake, and though the first statement is now true, the argument is logically incorrect
19
The permanent illusion
This illusion of probabilistic proof by contradiction reflects our desire to have a probabilistic statement regarding a hypothesis arising from the data Unfortunately we only have the probability of the data given the truth of a null hypothesis P(D|H0) ≠ P(H0|D)
20
The Bayesian Approach Recall that we’ve described research in psychology as evidence gathering and presenting a case for others to take into consideration Part of the evidence gathering should include reliance on the results of previous research We can choose to do this explicitly in our analyses
21
The Bayesian Approach Bayesian inference actually includes prior information to inform the present situation1 Although the priors may be largely “subjective”, the procedure can provide a p-value for a hypothesis And in fact it has been shown that we use this regularly in our daily lives 1. I will provide supplemental notes for an example after we do some regular hype testing.
22
The Bayesian Approach The gist with the Bayesian approach is that you end up with a probability regarding a hypothesis and more centered on prior beliefs than frequency of outcomes The key is that you have to think hard about what a viable alternative hypothesis would be, as well as the likelihood of either hypothesis While this might be too subjective of an approach for some (it was for Fisher), others feel it doesn’t make sense to go about things otherwise.1 1. Thomas Bayes came up with his probability in the 18th century, and others were in that same neck of the woods of thinking even earlier. So while the approach may seem new to psychologists, it’s very old, and they are very typically behind the times.
23
Replication Another misinterpretation that comes from the confusion regarding a probability of a hypothesis, some think the probability of replication can be estimated with a NHST p- value P(D|H0) does not imply anything about p(Replication) either
24
P(D|H0) Belief in the Law of Small Numbers (Kahneman & Tversky)
Some believe that significant results arising from presumably representative (though relatively small) samples are automatically strong findings that will most likely replicate1 The fact is that the significant result may be very unlikely to replicate Also, just because we reject the null does not imply a theory is correct The statistical test rarely reflects the actual research idea2 This holds for effect sizes also, which just like statistical significance and everything else, are subject to sampling variability. I’ve always found it a bit odd to see Thesis and Dissertations with their specific hypothesis section go on to test none of those hypotheses.
25
The nil hypothesis Another criticism of current NHST methodology is the rejection of a nil hypothesis Nil hypothesis: no effect Essentially this is setting up a straw man, we go after a weak target, refute it, and pretend we have something to show and pat ourselves on the back by doing so In a sample there is always a difference to some extent, there is always some relation between variables1 Can you tell which of the following correlations came from two random (i.e. independent) variables (N=25)?2 .107 -.325 .261 .013 -.130 .114 -.107 -.118 -.069 .043 Unless independence is enforced via random assignment or other methods Don’t take my word for it, pull up R and run the following a few times: x = rnorm(25) y= rnorm(25) cor(x,y)
26
Summary of Problems with NHST
Misinterpretations of p-values Probability that the result is a result of sampling error i.e. p = .05 means only a 5% chance it’s due to chance (sampling error) There is a 100% chance the result is due to sampling error Probability of the null hypothesis Is attainable through Bayesian approach but not NHST Probability that if the null hypothesis is rejected, we’d be wrong That’s α That 1- p = the probability of the alternative hypothesis Again, attainable through Bayesian approach but not NHST That 1 – p = probability of replication
27
Summary of Problems with NHST
Mistaken conclusions That the p suggest something about the magnitude of effect by itself Also, that failure to reject null means the population effect size is zero- Absence of evidence is not evidence of absence Rejection of null means the alternative hypothesis is true Failure to reject implies equivalence More on that later Rejecting the null means the design was sound If no rejection, the study is a failure Rejection of null identifies causality Failure to replicate refers to conclusions about the Ho The effect size may be the same but due to sample size, one rejects, one doesn’t
28
Is NHST that bad? What’s right with NHST?
As Krantz and many others (even later Meehl) have pointed out, statisticians have no real issues with NHST in and of itself, so why do the psychologists? The problem lies in interpretation, not necessarily of method The problem with p-values Don’t tell us the probability of a hypothesis Not an estimate of practical importance The problem with α Can change when assumptions don’t hold or data is incorrectly collected Never know if we are actually making the error Controlling β is probably a more pressing issue in most situations in the social sciences yet there is far less concern with it
29
Is NHST that bad? It does address sampling error
Some research questions require a dichotomous answer Is this better than that? So NHST does have utility and is itself not to blame for its misuse and misinterpretation
30
Psychologist’s view of statistics
Quantitative psychologists typically have a much different take on statistical analysis than the more applied researcher Focus is more on strong theoretical reasoning, sound design and measurement, and looking for large or obvious effects The average joe psychologist however tends to focus more on statistical significance to back up their idea
31
The abuse of hypothesis testing
As Cohen noted, NHST has largely been abused in psychological research Over-reliance on arbitrary ‘significance’ levels rather than effect sizes General misinterpretation of results For example, smaller ps representing ‘more significant’1 results Inadequate reporting of information 1. I beg you not to use terms like ‘marginally’ significant, or ‘highly’ significant etc. Would you ever use phrases like ‘maybe’ significant or ‘significantly’ significant in the real world? Why do we here? Something is judged significant or not by the researcher, it either is meaningful to them or it isn’t, or it’s vague, and that may be meaningful in and of itself, not to mention the fact that a non-significant result may be very significant in and of itself. Is it any wonder researchers are confused?
32
What’s the alternative?
Although much knowledge has been gained using NHST over the years the problems with interpretation cannot be ignored So what should we do? “A magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn’t exist” (Cohen, 1997, p. 31) Bayesian inference (which does give P(H|D), but has its own problems) More focus on confidence intervals Effect sizes Better NHST Graphs and more descriptives
33
Solutions: Bayesian Benefits: probabilities and intervals that make sense Gives P(H|D) Problems: we’re substituting one framework for another Too subjective? Standards for priors? Would conclusions drawn be that different?
34
Solutions: Confidence intervals
Provide a way to test hypotheses while also giving more information to use in the assessment of results Point estimate still provided NHST is still conducted Confidence in results are reflected in width of interval Wider = more ignorant More overlap suggests less difference in estimates May lead to more thinking about the size of the differences rather than statistical significance
35
Solutions: Effect Sizes
Should be standard reporting at this point (getting there), it should also be standard that they are the focus of interpretation (still a long way off) Problem: Effect sizes are subject to sampling variability just like anything else Provide CIs for effect sizes
36
Solutions: Other NHST approaches
Jones & Tukey 2000 (Kaiser, 19601) Dueling alternatives Problem: still dichotomous decision based on lone p-value Equivalence testing for group comparisons Shifts focus to meaningful effect size Shifts focus to uncertainty of results Can claim difference, equivalence, or not enough info More later 1. Kaiser also proposed a two-alternative approach but still kept the null hypothesis as a possibility
37
Solutions: Visual and descriptive interpretation
Never underestimate the power of a good graph or the trends able to be spotted in descriptive information that might be clouded among multiple hypothesis tests Basic question: What seems to be going on?
38
Solutions: General Don’t forget to use your noggin when conducting analyses- don’t let SPSS or textbooks tell you what it means.1 There are other ways to analyze data without using NHST. But don’t fall in to the same trap of rigid thinking with those either. Focus on effect sizes and interval estimation, report as much information as possible, let others know exactly why you came to your conclusions Collect good data (not as easy as it sounds) and have good theories and clear ideas driving the motivation for your research. Let the data tell its story Replicate whenever possible, validate your own data 1. This is why part of the class is designed to instill confidence in your ability to make your own decisions. While this might be difficult for beginning researchers, letting heuristics and programs tell you what is significant starts one on a path to doing poor research that may never be wavered from. Better to make mistakes in thinking for yourself than ignoring the flawed logic of rule-based decision-making
39
Hypothesis testing should include…
Checking the adequacy of a model Check assumptions and do something if problems arise Use valid and reliable measures of the construct under investigation Use the test of a nil hypothesis as a preliminary step at most Test approximate null hypotheses Make appropriate decisions based on the situation It’s not just about type I error at .05
40
Resources Gigerenzer, G. (1993). The Superego, The Ego and the Id in Statistical Reasoning. In Keren & Lewis (Eds.) Data Analysis in the Behavioral Sciences. Cohen, J. (1994). The earth is round, p < American Psychologist, 49, Hubbard R. & Bayarri, M.J. (2003). Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing. The American Statistician. Volume: 57 Number: 3 Page: 171 – 178 Oakes, M Statistical Inference: A Commentary for the Social and Behavioral Sciences. Chichester, John Wiley & Sons. Abelson, Robert. Statistics as Principled Argument. Mahwah, NJ:Erlbaum, 1995. Some quotes:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.