Issues concerning the interpretation of statistical significance tests.

Issues concerning the interpretation of statistical significance tests

1 RR 20 hypothetical studies, 18 showing an effect, none “statistically significant” 0 

Failing to reject the null hypothesis In the previous slide, none of the studies were statistically significant since all the confidence intervals included a RR of 1.0 Therefore, none of the studies would reject the null hypothesis. However, failure to reject the null hypothesis does not prove that the null hypothesis is true! (In fact, most of the studies indicated an increased RR).

Failing to reject the null hypothesis Sometimes a study will fail to reject the null hypothesis even though the null hypothesis is false (i.e., there really is an effect but the result is not statistically significant). When a study fails to reject the null hypothesis when the null hypothesis is false, a “type II error” has occurred.

Four Components of statistical significance testing Sample size Minimum meaningful effect size –Minimum meaningful effect size is based on: Public health considerations Scientific considerations Clinically meaningful Results of previous studies Specification of Type I error (α error) Specification of Type II error (β error)

Four Components of statistical significance testing As part of designing the study, values for three of the four components are chosen and are used to compute the value of the fourth component. Usually, values for the type I error, the type II error, and the minimum meaningful effect size are chosen, and the value for the sample size is computed. –The computed sample size is the number of study subjects necessary to detect an effect equal to (or larger than) the chosen minimum meaningful effect size with the chosen type I and type II error rates.

Statistical Power What is “power”? –Probability of not making a type II error –Probability of correctly rejecting the null hypothesis – Power = 1 - β error = 1 - Type II error

Statistical Power The level of statistical power in a study depends on the choices for the three other components: –Type I error (α error) Related consideration: whether the alternative hypothesis is one-sided or two-sided –Sample size –Minimum meaningful effect size Power is also affected by the presence of biases

Ways to increase statistical power of a study Choose a higher value for type I error (α error) –e.g., choose.10 rather than.05, and a one-sided rather than two-sided alternative Increase the sample size –increases precision of the study by reducing the variance –may not be possible if the number of subjects exposed and/or willing to participate is not large –resource limitations may prevent increasing the sample size

Ways to increase statistical power of a study Choose larger minimum meaningful effect size –Problem: important, although small, effect sizes may be ignored Minimize bias –e.g., bias towards the null due to non-differential exposure misclassification makes it harder for a study to achieve statistically significant results.

Relationship between Type I error and Type II error (and power)

Summary of Statistical Power Considerations Many studies are under-powered, especially given the likely presence of biases towards the null (e.g., exposure misclassification, healthy worker effect biases) Routine selection of.05 as the value for type I error (α error) can lead to under-powered studies –Choice of values for type I and type II errors should be based on public health or clinical costs of each error

Summary of Statistical Power Considerations The level of statistical power is a function of the choices of values for the four components of statistical significance testing that is part of the study design process. However, once the study is conducted and a result is obtained, a post-study power calculation is relatively unimportant for interpreting the study findings. –Level of statistical power, like levels of type I and type II error and the sample size, are considerations that go into the design of a study, not considerations useful in interpreting the results of a study To interpret a study finding, calculate a confidence interval to indicate the range of effect values (e.g., range of RRs) that are compatible with the effect estimate obtained in the study.

Interpretation of study findings: confidence intervals and p-values

Limitations of p-values and confidence intervals Both assume no systematic bias is present –A certainly false assumption! Confidence interval indicates precision of the effect estimate but may tell us nothing about the true value of the effect: –the confidence interval may not contain the true effect value because of chance or because of systematic bias A 2-sided p-value does not provide a clear indication of the direction, magnitude, or precision of the association

Other limitations of p-values The P-value is mostly a function of sample size –if the sample size is very large, even trivial departures from the null will be statistically significant –a tiny effect in a very large study can have the same p-value as a huge effect in a small study Focuses attention on an arbitrary cutoff (e.g.,.05) –Interpretation of a p-value of.049 is different than the interpretation of a p-value of.051 –Focuses attention on the lower boundary of a confidence interval and ignores the rest of the interval –It is a qualitative assessment (yes/no statistically significant) when a quantitative assessment would be more appropriate

Other limitations of p-values The value of the p-value depends on the statistical test: –exact vs asymptotic –trend test vs testing for any differences, disregarding the ordering of exposure –One-tail vs two-tail tests: same data but two different interpretations The alternative hypothesis may be rejected when the data do not support the null hypothesis very much (e.g., p=.06) –data may be more likely under the alternative hypothesis The null hypothesis may be rejected even when the data are much less likely under alternative hypotheses.

Other limitations of p-values P-value is simply a measure of rarity with respect to one hypothesis – the null hypothesis (“chance”) But life is full of rare events which we often ignore. We take notice of a rare event if there is a plausible competing hypothesis under which the probability of the event is relatively higher. A proper assessment of the plausibility of the null hypothesis requires the simultaneous consideration of the relative plausibility of alternative hypotheses.

Advantages of the confidence interval A good indicator of the precision of the effect estimate –the p-value is affected by both the precision and the magnitude of the effect estimate, whereas the width of a specific (1 - α) confidence interval is affected solely by the precision of the effect estimate –the width of the confidence interval is a much better indicator of the impact of random error than the p-value

Advantages of the confidence interval A good indicator of the likely magnitude of the effect –provides a range of values for the association, under the assumption that the difference between the true value and the observed value for the association is due only to random variation –provides a range of values for the effect that are compatible with the data obtained –values located centrally in the interval, i.e., near the point estimate, are more compatible with the data than values near the boundary of the interval focus should therefore be on the entire interval, especially the values around the center of the interval, as well as the upper and lower boundaries.

Advantages of the confidence interval A confidence interval and the point estimate provide sufficient information to construct a graph of all possible p-values and confidence intervals This graph is called the P-value function The P-value function gives the p-values for the null hypothesis, as well as every alternative hypothesis, for the parameter of interest

P-value function for RR=1.5, 95% CI: 0.5, 4.5 one-sided p = 0.23

P-value function Red line is the 95% confidence interval Green line marks off the null hypothesis –the one-sided p-value for the null hypothesis is where the green line, vertical from RR=1.0, hits the graph The 95% confidence interval represents only one possible horizontal slice through the P-value function. –e.g., the 80% confidence interval can be read off the graph by following where the horizontal line through the one-sided p of 0.1 hits the graph: 0.7, 3.1 The p-value for every possible hypothesis for the true RR can be read from this graph.

P-value function By comparing p-values for different hypotheses (e.g., RR=1.0 vs RR=2.0 vs RR=3.0), one can get an indication of the relative strength of the evidence that the obtained data provides for each hypothesis –In this example, the one-sided p-value for RR=2 is larger than for RR=1.0 (null hypothesis) and for RR=3. –This indicates that the obtained data are more probable under the hypothesis that RR=2 than for the null (RR=1) hypothesis or the hypothesis that RR=3. –In particular, this means that the hypothesis that RR=2 is more supported by the observed data than the null hypothesis. –The RR most supported by the observed data is always the point estimate: RR=1.5, with a one-sided p-value of 0.50 Hypothesis1-sided P-value RR = 1 (null)0.24 RR = 20.30 RR = 30.11

P-value function The P-value function provides all the information about the data –The precision of the study is indicated by the width of the graph, and the magnitude of the association corresponds to the peak of the graph The P-value function: –avoids the arbitrariness of choosing a specific (1 – α) confidence interval (e.g., a 95% confidence interval) –avoids the arbitrariness of choosing a specific p-value cutoff, (e.g., p=.05) –focuses attention on the weight of the evidence for other possible hypotheses besides the null hypothesis

P-value function Since it is cumbersome to provide a P-value function for every estimate of RR, a confidence interval along with the point estimate can provide enough information so that an approximate P- value function graph can be drawn. More information on the P-value function curve can be provided if additional confidence intervals are presented –e.g., “nested” confidence intervals can be provided: 50% interval, 80% interval, 95% interval

Summary Many studies are under-powered. The issue of statistical power must be addressed in the design of the study. Statistical power is affected by sample size and the choice of values for Type I error and the minimum meaningful effect size. It is also affected by biases. To maximize power: –Increase the sample size –Increase the type I error –Increase the minimum meaningful effect size –Minimize biases

Summary The p-value, the confidence interval, and the P- value function all assume that only random variation is present. They do not address systematic biases (e.g., non-differential exposure misclassification, the healthy worker effect biases, selection bias, confounding). The confidence interval should not be interpreted as if it were simply just another way to determine statistical significance, (i.e., it should not be interpreted in the same way as the p-value for the null hypothesis)

Summary Properly interpreting a confidence interval requires taking into account the values in the center of the range as well as the values at both boundaries. A confidence interval and the point estimate together give an approximate indication of the P-value function curve. It is not true that a 95% confidence interval computed from a study contains the true parameter with 95% probability. The probability that it does contain the true parameter is undefined. –the “95%” refers to the frequency that a very large number of intervals constructed in this manner contain the true parameter, assuming that only random variability is present

Summary One should be aware that the choice of a particular confidence interval (e.g., 95% interval), and choice of a specific p-value cutoff for statistical significance, are arbitrary decisions, with no scientific or public health justification.

Berkson (1942) “If an event has occurred, the definitive question is not, “is this an event which would be rare if the null is true?” but “Is there an alternative hypothesis under which the event would be relatively frequent?”. If there is no plausible alternative at all, the rarity is quite irrelevant to a decision….”

Issues concerning the interpretation of statistical significance tests.

Similar presentations

Presentation on theme: "Issues concerning the interpretation of statistical significance tests."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Issues concerning the interpretation of statistical significance tests.

Similar presentations

Presentation on theme: "Issues concerning the interpretation of statistical significance tests."— Presentation transcript:

Similar presentations

About project

Feedback