NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands.

NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands Use and abuse of P values Clinical Research Methodology Course Randomized Clinical Trials and the “REAL WORLD”

5 Contents 1.P-value: What is it? 2.Type I error 3.Multiple testing 4.Type II error 5.Sample size calculation 6.Negative studies 7.Testing at baseline 8.Statistical significance  clinical relevance 9.Confidence interval  P-value 10.P-value of clinical trial  of epidemiological study 11.Take home messages

6 1. P-value: What is it?

7 Etoricoxib  Placebo Etoricoxib  Placebo –WOMAC Pain Subscale: difference in means = -15.07 –What does this result mean? –What do you expect if etoricoxib=placebo?  difference  0 –But even if etoricoxib=placebo, result will vary around 0 –What is a large/small difference? –What is the play of chance? The same questions for the other scores & comparisons The same questions for the other scores & comparisons

8 1. P-value: What is it? Etoricoxib  Placebo Etoricoxib  Placebo –Suppose H0: E=P –P=0.05  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) –P=0.01  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and only 1% is MORE EXTREME –P<0.0001  result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and IS VERY EXTREME

9 1. P-value: What is it? GENERAL RULE GENERAL RULE –When P < 0.05 (= significance level  ):  Result is considered to be TOO EXTREME to believe that H0 is true  H0 is rejected  we do NOT believe that E=P  Significant at 0.05 (*, **, ***) –When P  0.05:  Result could have happened when H0 is true  H0 is NOT rejected  it is possible that E=P  Result is  0, but we believe that this is due to PLAY OF CHANCE  NOT significant at 0.05 (NS)

10 1. P-value: What is it? Results E  C  P Results E  C  P –E  P, WOMAC Pain  P < 0.0001  Significant at 0.05 (***)  We do NOT believe that E=P –E  C, WOMAC Physical Function  P = 0.367  NS  It could be that E=C, result is PLAY of CHANCE –E  C, Patient Global Assessment  P = 0.051  NS  It could be that E=C, result is PLAY of CHANCE

11 1. P-value: What is it? hypothesis testing Previous decision rule = hypothesis testing –H0: E=P HA: E≠P –Test H0: E=P versus HA: E≠P – –Using a statistical test (t-test,  ²-test, etc) – –With 2-sided significance level =  = 0.05 – –In clinical trial setting:  H0: E  P HA: E > P  Above test is interpreted as: H0: E  P versus HA: E > P  1-sided  And at 1-sided significance level =  /2 = 0.05/2 = 0.025 (2.5%) resultwrong side efficacy of E over P is not demonstrated When result is on the wrong side (E < P) with P < 0.05, then efficacy of E over P is not demonstrated

12 1. P-value: What is it? What if H0: E=P is true & P=0.023? What if H0: E=P is true & P=0.023? –We will reject H0 –We will make an ERROR = Type I error P(Type I error) = False-positive rate P(Type I error) = False-positive rate = Probability that result belongs to 5% extreme results if H0 is true = 0.05

13 2. Type I error Type I error: Practical implications Type I error: Practical implications –Suppose H0 is TRUE –Risk = 5% implications:  100 studies  on average 5 studies wrong conclusion   Prob(at least 1 study wrong conclusion)  1 Regulatory agencies mandate a strict control of the overall false-positive rate Regulatory agencies mandate a strict control of the overall false-positive rate False positive trial findings could lead to approval of inefficacious drugs False positive trial findings could lead to approval of inefficacious drugs

14 3. Multiple testing Multiple testing: Definition Multiple testing: Definition –Suppose H0 is TRUE –Test 1 (WOMAC pain subscale): risk = 5% –Test 2 (WOMAC Physical Function Subscale): risk = 5% –Test 1 & Test 2: risk  5% + 5% = 10% of claiming that 2 treatments (on one of the tests) are different when they are not –If no adjustment: multiple testing problem

15 3. Multiple testing Multiple testing: Typical cases Multiple testing: Typical cases –2 treatments are compared for several endpoints –More than 2 treatments are compared –2 treatments are compared in several subgroups –2 treatments are compared at several time points –2 treatments are compared at several time points

16 3. Multiple testing: example 2 treatments are compared for several endpoints 2 treatments are compared for several endpoints

17 3. Multiple testing: example More than 2 treatments are compared More than 2 treatments are compared

18 3. Multiple testing: example 2 treatments are compared in several subgroups 2 treatments are compared in several subgroups –Treatments were not significantly different overall –Then, treatments were compared in subgroups:  Males & Females  < 60 yrs &  60 yrs  Diabetes & no-diabetes .... –Suppose in 1 subgroup: P < 0.05, meaning????  Significant result will be a play of chance

19 3. Multiple testing: example 2 treatments are compared at several time points 2 treatments are compared at several time points Comparison at each time point: PLAY OF CHANCE!

20 3. Multiple testing: example Protocol specified: Protocol specified: 2.2 Administration of visits Patients will be examined at baseline (day 0), day 7, day 14 and day 28. At each visit the systolic BP, etc... will be measured. 9.4 Primary endpoint The primary endpoint for the comparison of treatment A  B is systolic BP.

21 3. Multiple testing: example This “scientific finding” was printed in the Belgian newspapers! This “scientific finding” was printed in the Belgian newspapers! It was even stated that those who awake before 7.21 AM, have a statistically significant higher stress level during the day, than those who awake after 7.21 AM!

22 3. Multiple testing: example Signs of the times: Feb 22nd 2007 | SAN FRANCISCO From The Economist print edition Interesting finding? Signs of the times: Feb 22nd 2007 | SAN FRANCISCO From The Economist print edition Interesting finding? PEOPLE born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs. Sagittarians are 38% more likely than others to land up there because of a broken arm. Those are the conclusions that many medical researchers would be forced to make from a set of data presented to the American Association for the Advancement of Science by Peter Austin of the Institute for Clinical Evaluative Sciences in Toronto. At least, they would be forced to draw them if they applied the lax statistical methods of their own work to the records of hospital admissions in Ontario, Canada, used by Dr Austin.

23 3. Multiple testing Multiple testing: Solution?? Multiple testing: Solution?? –Choose 1 primary endpoint  risk = 5% –What if more than one endpoint is needed?  Construct combined endpoint based on clinical/statistical reasoning  Correct for multiple testing –What for other (secondary+ tertiary) endpoints?  Call analyses EXPLORATORY  Correct for multiple testing

24 3. Multiple testing Multiple testing: Solution?? Multiple testing: Solution?? –Test 1 (WOMAC pain subscale): risk = 5% –Test 2 (WOMAC Physical Function Subscale): risk = 5% –Test 1 & Test 2: risk = 10% –Both tests claim significance if P < 0.05 –Bonferroni adjustment: significance if P < 0.05/2=0.025  Family-wise error rate = 0.05 –More sophisticated approaches of Simes, Holm, Hochberg and Hommel, Closed Testing procedures,... 2.5% 5%

25 3. Multiple testing CPMP guidance document CPMP guidance document “Points to consider on multiplicity issues in clinical trials” (Sept 19, 2002) “Points to consider on multiplicity issues in clinical trials” (Sept 19, 2002) “A clinical study that requires no adjustment of the Type I error is one that consists of two treatment groups, that uses a single primary variable, and has a confirmatory statistical strategy that pre-specifies just one single null hypothesis relating to the primary variable and no interim analysis”

26 4. Type II error Type I error: Type I error: –Result is statistically significant (P < 0.05) –Risk of making an error when H0 is true= 5% –(We do NOT know if H0 is true) Type II error: Type II error: –Result is NOT statistically significant (P  0.05) –Risk of making an error when H0 is NOT true= ??? –(We do NOT know if H0 is NOT true)

27 5. Sample size calculation P(Type II error): 1-  = 1- Power P(Type II error): 1-  = 1- Power –LARGE(R) in small studies –Can be controlled by adapting study (sample) size –Calculation sample size:  Determine clinically important difference   Search for information –% rate control group –SD of measurements  Fix P(Type II)  0.20  Power  0.80 (80%)  Look for statistician ((s)he will look for computer program)  Pray  Let computer work  sample size

28 5. Sample size calculation: example  = 0.05 power = 0.95  = 20% n = 2x300

29 5. Sample size calculation: example??

30 6. Negative studies Negative study: Not significant study Negative study: Not significant study –Sample size calculation done (power at least 80%) ? –Yes:  Difference between treatments is probably smaller than  –No:  Message ????  DOES NOT imply: NO difference between treatments

31 6. Negative studies: example Sample size calculation???? Message????

32 6. Negative studies: “Trend” Trend in the data: Trend in the data: –P > 0.05, but difference is in the good direction –One speaks of a “trend in the data” –OK?  No, for confirmatory study  Perhaps, for pilot study or exploratory studies

33 7. Testing at baseline Why no P-values? How many significant (at 0.05) tests would you expect?

34 8. Statistical significance  clinical relevance Statistical significance: Statistical significance: –P < 0.05 –Message: two treatments are (probably/possibly) different Clinical relevance: Clinical relevance: –Difference is clinically relevant

35 8. Statistical significance  clinical relevance: Example Compare two treatments Compare two treatments –Response = 10-year mortality –2 x 200 patients –A: 2%, B: 10% –Chi-square test: P < 0.001 Measures of effect Measures of effect –ar = 10%-2% = 8% (abs risk reduction) – ar = 10%-2% = 8% (abs risk reduction) – rr = 10%/2% = 5(risk ratio)

36 8. Statistical significance  clinical relevance: Example Compare two treatments Compare two treatments –Response = 10-year mortality –2 x 100,000 patients –A: 0.002%, B: 0.0010% –Chi-square test: P < 0.001 Measures of effect Measures of effect –ar = 0.0010%-0.002% = 0.008%(abs risk reduction) – ar = 0.0010%-0.002% = 0.008%(abs risk reduction) – rr = 0.0010%/0.002% = 5(risk ratio)

37 8. Statistical significance  clinical relevance: Conclusion Conclusion Conclusion – –For each  (small)  (≠0), there is a sample size such that H0 is rejected with high probability Implications Implications –Clinical trials are often too small to detect rare safety issues – –When registered and on the market, after several years a safety issue appears (VIOX story)

38 8. Statistical significance  clinical relevance: Further reflections Practical conclusions Practical conclusions –Even if result is not significant, we will NOT conclude that H0 is true –Why doing the significance test, if we don’t believe in it? –Better estimate difference in treatment effect + uncertainty Classical table indicating two types of errors (Decision-theoretic approach of Neyman-Pearson). Indicates that we can conclude in practice that the 2 treatments are equally good It is not possible in statistics to show that 2 treatments are equally good (non-inferiority talk). DO NOT BELIEVE that H0 is TRUE in practice We even DO NOT BELIEVE that H0 is TRUE in practice!

39 9. Confidence interval  P-value

40 9. Confidence interval  P-value 95% confidence interval 95% confidence interval –Expresses uncertainty about true difference –When small  good idea about true treatment effect Examples Examples –WOMAC Pain Subscale:  E  C: 95% CI = [-7.02, 0.77]  0 is possible  E  P: 95% CI = [-19.72, -10.41]  E is better  C  P: 95% CI = [-16.57, -7.32]  C is better GENERAL RESULT: P<0.05  95% CI does not contain 0 GENERAL RESULT: P<0.05  95% CI does not contain 0

41 9. Confidence interval  P-value medication study 95% confidence interval medication study Two anti-hypertensive drugs 95% CI gives a clearer message

42 10. P-value clinical trial  epi study Clinical trial Clinical trial –Randomized –No confounding –P < 0.05  causal effect of treatment on patient’s condition Epidemiological study Epidemiological study –Observatory – Observatory – Possible confounding –P < 0.05  at most association, correction for confounding

43 10. P-value clinical trial  epi study

44 11. Biased set up & reporting

45 11. Biased setup & reporting Bias in set up of studies, e.g. inappropriate doses of competing drug Bias in set up of studies, e.g. inappropriate doses of competing drug Choice of patient populations, e.g. exclusion of patients who were previously nonresponder to treatment Choice of patient populations, e.g. exclusion of patients who were previously nonresponder to treatment Noninferiority designs with different thresholds Noninferiority designs with different thresholds Biased reporting, e.g. minimal information on negative aspects of drug of sponsor Biased reporting, e.g. minimal information on negative aspects of drug of sponsor

46 12. Take home messages If possible, take 1 primary endpoint If possible, take 1 primary endpoint Always determine necessary sample size Always determine necessary sample size Always WATCH OUT for problem of multiple testing Always WATCH OUT for problem of multiple testing Always and ONLY interpret NS as NOT possible to show “difference” Always and ONLY interpret NS as NOT possible to show “difference” Always be careful when talking about “trend” Always be careful when talking about “trend” Always determine 95% confidence intervals Always determine 95% confidence intervals

Thank you for your attention

NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands.

Similar presentations

Presentation on theme: "NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands.

Similar presentations

Presentation on theme: "NY, 14 December 2007 Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands."— Presentation transcript:

Similar presentations

About project

Feedback