Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.

Biostatistics in Practice Peter D. Christenson Biostatistician http://gcrc. LABioMed.org /Biostat Session 3: Testing Hypotheses

Readings for Session 3 Select sections from www.StatisticalPractice.com entitled: Significance test / hypothesis testing Significance tests simplified

Session 3 Outline Mechanics of statistical testing TBI study example t-Test p-values Conceptual understanding Key issues Comparison to diagnostic testing Not covered in detail; needed in session 4 on power.

Goal: Do Groups Differ By More than is Expected By Chance? Cohan (2005) Crit Care Med;33:2358-66.

Goal: Do Groups Differ By More than is Expected By Chance? First, need to: Specify experimental units (Persons? Blood draws?). Specify single outcome for each unit (e.g., Yes/No, mean or minimum of several measurements?). Examine raw data, e.g., histogram, for meeting test requirements. Specify group summary measure to be used (e.g., % or mean, median over units). Choose particular statistical test for the outcome.

Outcome Type → Statistical Test Cohan (2005) Crit Care Med;33:2358-66.... Medians %s Means Wilcoxon Test ChiSquare Test t Test

Minimal MAP: Group Distributions of Individual Units AI Group (N=42) Stem.Leaf # 7 6 1 7 11334 5 6 555 3 6 01112344 8 5 5566778 7 5 01222234 8 4 57788 5 4 23 2 3 6 1 3 13 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Non-AI Group (N=38) Stem.Leaf # 7 79 2 7 00111234 8 6 5556777888 10 6 00112234 8 5 67999 5 5 3 1 4 79 2 4 04 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 → Approximately normally distributed → Use means to summarize groups. → Use t-test to compare means.

Goal: Do Groups Differ By More than is Expected By Chance? Next, need to: 1. Calculate a standardized quantity for the particular test, a “test statistic”. Often: t=(observed - expected diff)/SE(obs diff) 2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly std normal bell curve. 3. Declare groups to differ if test statistic is too deviant. Often: absolute value of t >~2.

t-Test for Minimal MAP: Step 1 1. Calculate a standardized quantity for the particular test, a “test statistic”. Often: t=(observed - expected diff)/SE(obs diff) Observed difference = diff in means = 63.4 - 56.2 = 7.2 Expected Difference = 0 if groups do not differ. SE(Obs Diff) ≈ sqrt[SEM 1 2 + SEM 2 2 ] = sqrt(1.66 2 +1.41 2 ) ≈ 2.2 AI N 42 Mean 56.1666667 Std Dev 10.7824634 SE(Mean) 1.66=10.78/√42 Non AI N 38 Mean 63.4122807 Std Dev 8.7141575 SE(Mean) 1.41=8.71/√38 → Test Statistic = t = (7.2 - 0)/2.2 = 3.28

t-Test for Minimal MAP: Step 2 2.Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly std normal bell curve. Expect 0.95 Chance Observed = 3.28 Expected values for test statistic if groups do not differ. Area under sections of curve = probability of those values (1 for -∞ to ∞). Prob (-2 to -1) is Area = 0.14

t-Test for Minimal MAP: Step 3 Expect 95% Chance Observed = 3.28 3.Declare groups to differ if test statistic is too deviant. Often: absolute value of t >~2. Convention: “Too deviant” is ~2. “Two-tailed” = the 5% is allocated equally for either group to be superior. 2.5% Conclude: Groups differ since ≥3.28 has <5% if no diff in entire populations.

t-Test for Minimal MAP: Step 3 - p value Expect 95% Chance Observed = 3.28 3.Declare groups to differ if test statistic is too deviant. Often: absolute value of t >~2. p-value: Probability of a test statistic at least as deviant as observed, if populations really do not differ. Smaller values ↔ more evidence of group differences. Area = 0.0007 p value = 2(0.0007) = 0.0014 <<0.05

Back to Paper: Minimal MAP Δ= 63.4-56.2= 7.2 is the best guess for the MAP diff between a randomly chosen AI and non-AI patient, w/o other patient info. Δ= 7.2 is the best guess for the MAP diff between the means of “all” AI and non-AI patients. We are 95% sure that diff is within ≈ 7.2±2SE(Diff) = 7.2±2(2.2) = 2.8 to 11.6. Δ= 7.2 is statistically significant (p=0.0014); i.e., only 14 of 1000 sets of 80 patients would differ so much, if AI and non-AI really don’t differ in MAP. Is Δ= 7.2 clinically significant? … significant for basic science?

Confidence Intervals ↔ Tests 95% Confidence Intervals Non-overlapping 95% confidence intervals, as here, are sufficient for significant (p<0.05) group differences. However, non-overlapping is not necessary. They can overlap and still groups can differ significantly. If the single 95% CI for the difference (2.8 to 11.6 here) does not contain 0, then the groups differ with p<0.05.

Back to Paper: Experimental Units Cannot use t-test for comparing lab data for multiple blood draws per subject.

Tests on Percentages Cannot use t-test for comparing lab data for multiple blood draws per subject. Is 26.3% vs. 61.9% statistically significant (p<0.05), i.e., a difference too large to have a <5% of occurring by chance if groups do not really differ? Solution: same theme as for means. Find a test statistic and compare to its expected values if groups do not differ. See next slide.

Tests on Percentages Cannot use t-test for comparing lab data for multiple blood draws per subject. Expect 1 Observed = 10.2 Area = 0.002 Chi-Square Distribution 95% Chance 5.99 Here, the test statistic is a ratio, expected to be 1, rather than a difference, expected to be 0. Test statistic=10.2 >> 5.99, so p<0.05. In fact, p=0.002.

Example for Conceptual Approach Consider a parallel study: 1.Randomize an equal number of subjects to treatment A or treatment B. 2.Follow all subjects for a specified period of time. 3.Measure X= post-pre change in an outcome, such as cholesterol. Primary Aim: Do treatments A and B differ in mean effectiveness? Restated aim: If μ A and μ B are the true, unknown, mean post- pre changes that would occur if all potential subjects received treatment A or treatment B, do we have evidence from our limited sample whether μ A ≠ μ B ?

Extreme Outcome #1 Suppose results from the study are plotted as: Obviously, B is more effective than A. AB X Each point is a separate subject.

Extreme Outcome #2 Suppose results from the study are plotted as: Obviously, A and B are equally effective. AB X Each point is a separate subject.

More Realistic Possible Outcome I Suppose results from the study are plotted as: Is the overlap small enough to claim that B is more effective? AB X Each point is a separate subject.

More Realistic Possible Outcome II Suppose the ranges are narrower, with the same group mean difference: Now, is this minor overlap sufficient to come to a conclusion? AB X Each point is a separate subject.

More Realistic Possible Outcome III Suppose the ranges are wider, but so is the group difference: Is the overlap small enough to claim that B is more effective? AB X Each point is a separate subject.

More Realistic Possible Outcome IV Here, the ranges for X are the same as the last slide, but there are many more subjects: So, just examining the overlap isn’t sufficient to come to a conclusion, since intuitively the larger N should affect the results. AB X Each point is a separate subject.

Our Goal Goal: We need a rule that can be consistently applied to most studies to make the decision whether or not μ A ≠ μ B. From the previous 4 slides, relevant measures that will go into our decision rule are: 1.Number of subjects, N; could be different for the groups. 2.Difference between groups in observed means (X-bar for A and for B subjects). 3.Variability among subjects (SD for A and B subjects).

Goal, Continued Goal: We need a rule that can be consistently applied to most studies to make the decision whether or not μ A ≠ μ B. Other relevant issues: 1.Our conclusion could be wrong. We need to incorporate a mechanism for minimizing that possibility. 2.Small differences are probably unimportant. Can we incorporate that as well?

A Graphical Look at All of the Issues The figure on the following slide shows most of the issues that are involved in testing hypotheses. It is complicated, but we will go through each of the factors that it addresses, on slides after the figure: 1.Null hypothesis H 0 vs. alternative hypothesis H A. 2.Decision rule: Choose H A if ….[involves Ns, means and SDs]. 3.α=Probability (Type I error)= Prob (choosing H A when H 0 is true). 4.β=Probability (Type II error)= Prob (choosing H 0 when H A is true). 5.What changes if N was larger?

Graphical Representation of Hypothesis Tests

1: Null hypothesis H 0 vs. alternative hypothesis H A. All statistical tests have two hypotheses to choose from: The null hypothesis states a negative conclusion, that there is “no effect”, which could mean various specific outcomes in different studies. It always includes at least one mathematical expression that is 0. Here, the null hypothesis is H 0 : μ A - μ B = 0. This states that the post-pre changes are, on the average, the same for A as for B. The left (red) curve has it’s peak at this 0. The alternative hypothesis includes every possibility other than 0, i.e., H A : μ A - μ B ≠ 0. In the figure, we chose just one alternative for illustration, namely that μ A - μ B = 3. The right (blue) curve has it’s peak at this value of 3. For each curve, the height represents the relative frequency of subjects, so more subjects have X’s near the peak.

2: Decision Rule for Choosing H 0 or H A. A poor, but reasonable rule. First suppose that we only consider choosing between H 0 and the particular H A : μ A - μ B = 3, as in the figure. Common sense might say that we calculate x-bar (which is the mean of changes for A subjects, minus the mean of changes for B subjects), and then choose H 0 if x-bar is closer to 0, the hypothesized value under H 0, or choose H A if it closer to 3, the hypothesized value for H A. The green line in the figure is on the x-bar from the sample, which is 1.128, and so H A would be chosen with this rule, since it is closer to 0 than 3. A problem with this rule is that we cannot state how certain we are about our decision. It seems like the reasonable choice between the 2 possibilities, but if we used the rule in many studies, we could not say that most (90%?, 95%?) were correct.

2: Decision Rule for Choosing H 0 or H A. The correct rule. To start to quantify the certainty of some conclusions we will make, recall the reasoning for confidence intervals. If H 0 is true, we expect that x-bar will not only be close to 0, but that with 95% probability, it will be within about* ±2SE of 0, i.e., between about -2.8 and +2.8. This is the non-\\\’d region under the H 0 (red) curve. Thus, the decision rule is: Choose H A if x-bar is outside 0±2SE, the critical region. The reason for using this rule is that if H 0 is really true, then there is only a 5% chance we would get an x-bar in the critical region. Thus, if we decide on H A, there is only a 5% chance we are wrong for any particular test. Roughly, if the rule is applied consistently, then only 5% of statistical tests will be false positive conclusions, although which ones are wrong is unknown. *See a textbook for exact calculations. The multiplier is slightly larger than 2.

3: Probabilities of False Positive Conclusions A false positive conclusion, i.e., choosing H A (positive conclusion) when H 0 is really true (so the conclusion is false) is considered the more serious error, denoted “Type I”. We have guaranteed (previous slide) that the rate for this error, denoted α=level of significance, is 0.05, or that there is a 5% chance of it occurring. The 0.05 or 5% value is just the conventional level of risk for positive conclusions that scientists have decided is acceptable. The FDA also requires this level in most clinical studies. The concept carries over for other levels of risk, though, and statistical tables can determine the critical region for other levels, e.g., approximately 0±1.65SE for α=0.10, where we would choose H A more often, and make twice as many mistakes in the long run in so doing.

4: Probabilities of False Negative Conclusions In our figure example, we choose H 0, i.e., no treatment difference, i.e., a negative conclusion, since x-bar=1.128 is between -2.8 and +2.8. If we had chosen H A, we would know there was only a 5% chance we were wrong. Can we also quantify the chances of a false negative conclusion, which we might be making here? Yes, but it will depend on what really constitutes “false negative”. I.e., we conclude μ A - μ B = 0, but if really μ A - μ B = 0.0001, are we wrong in a practical sense? Often, a value for a clinically relevant effect is specified, such as 3 in the figure example. Then, if H A : μ A - μ B =3 is really true, but we choose H 0, we have made a type-2 error. It’s probability is the area under the correct (H A now, blue) curve in the region where H 0 is chosen (///). The computer needs to calculate this, and it is 0.41 here.

3 and 4: Tradeoffs Between Risks of Two Errors In our figure example, if μ A - μ B =3 is the smallest difference that we care about (smaller differences are 0 in a practical sense), then we have an α=0.05 chance of wrongly declaring that treatments differ when in fact they are identical, and a β=0.41 chance of declaring them the same when they really differ by 3. If we try to decrease the risk of one of the errors, the risk of the other error increases, i.e., α↑ as β↓. [This is the same as sensitivity and specificity of diagnostic tests.] To visualize it on our figure, imagine shifting the ///\\\ demarcation at 2.8 to the left, to say 2.7. That increases α. Then the /// area, i.e., β, decreases. Practical application: If A is a current treatment, and B is a potential new one, then smaller αs mean that we are more concerned with marketing a non-superior new drug. Smaller βs mean we are more concerned with missing a superior new drug.

5: Effect of Study Size on Risks of Error In the previous slide, the FDA may want a small α, and drug company might want a small β. To achieve this, a larger study could be performed. We can verify this with our graph. In our figure example, suppose we had had a larger study, say twice as many subjects in each group. Then, both curves will be narrowed, since their widths depend on SE, which has N in the denominator. If we maintain α=0.05, the ///\\\ demarcation will shift to the left due to the narrowed left curve, and β will be much smaller, due to both the narrower right curve, and the demarcation shift. The demarcation could then be shifted to the right to lower α, which increases the current β, but still keeps it small. There are algorithms to choose the right N to achieve any desired α and β.

Power of a Study Statistical power = 1 – β. Power is thus the probability of correctly detecting an effect in a study. In our example, the drug company is really thinking not in terms of β, but in the ability of the study to detect that the new drug is actually more effective, if in fact it is. Since the FDA requires α=0.05, then a major component of designing a study is the determination of it’s size so that it has sufficient power. This is the topic for the next session #4.

Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.

Similar presentations

Presentation on theme: "Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses.

Similar presentations

Presentation on theme: "Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 3: Testing Hypotheses."— Presentation transcript:

Similar presentations

About project

Feedback