Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size and Power
Readings for Session 4 from StatisticalPractice.com Sample Size Calculations Some underlying theory and some practical advice. Controlled trials
Outline for this Session 1.Example from a current local protocol. 2.Review statistical hypothesis testing. 3.Formulate example as hypothesis test. 4.Software for study size and power. 5.Other issues.
Local Protocol Example Brief study outline: Subjects arrive at ER with TBI (traumatic brain injury). Those with low cortisol, indicating possible adrenal insufficiency and pituitary damage, may or may not recover better if given hydrocortisone (HC) injections. Subjects who consent are randomized to receive HC or placebo for 4 days. Changes in recovery status from pre to post injection periods are compared between HC and placebo groups. Project #10038: Dan Kelly & Pejman Cohan Hypopituitarism after Moderate and Severe Head Injury
Local Protocol Example, Cont’d “The primary outcomes for the hydrocortisone trial are changes in mean MAP and vasopressor use from the 12 hours prior to initiation of randomized treatment to the 96 hours after initiation.” Mean changes in placebo subjects will be compared with hydrocortisone subjects using a two sample t-test. Project #10038: Dan Kelly & Pejman Cohan Hypopituitarism after Moderate and Severe Head Injury Before examining the study size, let’s first discuss how the results will be analyzed.
Recall Statistical (t) test From Last Session Suppose results from the study are plotted as: Is Δ large enough to claim that HC is more effective? Use t-test. HCPlacebo Change in MAP Each point is the change in MAP for an individual subject. [Of course, the real study will have many more subjects.] Δ
Local Protocol Example: Analysis with t-test We are testing: H 0 : μ HC - μ Placebo = 0 vs. H A : μ HC - μ Placebo ≠ 0 where μ HC is the expected post-pre change in “all potential TBI patients” if HC therapy is applied as in this study. Our decision rule is: Choose H A if the estimate of μ HC - μ Placebo from our limited sample, i.e., the observed mean change under HC minus the observed mean change under placebo, call it Δ, is too far from 0 (which is specified by H 0 ). “Too far” is > t c *SE or < t c *SE, where t c is usually about 2. SE is SE(Δ), calculated from the data, and is ↓ for larger N and smaller SD. In other words, choose H A if |Δ| > t c *SE, or |t|=|Δ/SE| > t c. By following this rule, there is only a 5% probability of choosing H A if in fact H 0 is true.
Potentially Underpowered Studies From the previous slide: By following this rule, there is only a 5% probability of choosing H A if in fact H 0 is true. So, the probability is small (5%) that our study will (incorrectly) recommend that TBI subjects receive HC if it is worthless. But, is it able to correctly recommend that TBI subjects receive HC if it is effective? The probability of this is called the power of the study. Actually, there is not a single value for power. The study may have, say, 59% power if the true mean HC effect is 3 mmHg in MAP, but will have more power if the true effect is 4, since the subjects are more likely to reflect this greater effectiveness. Let’s go back to last session’s graph to see this.
Graphical Representation of Power H0H0 HAHA H 0 : true effect=0 H A : true effect=3 Effect in study=1.13 \\\ = Probability of concluding H A if H 0 is true. /// = Probability of concluding H 0 if H A is true. Power=100-41=59% Note greater power if larger N, and/or if true effect>3. 41% 5% Effect (HC change – Placebo change)
P-Value Recall that our decision rule is: Choose H A if |Δ| > t c *SE, or |t|=|Δ/SE| > t c. By following this rule, there is only a 5% probability of choosing H A if in fact H 0 is true. In practice, though, we do not just report our decision as H A or H 0. The p-value is the probability, if H 0 is correct, that we would observe a Δ as far from 0 as actually eventually occurred in the study. Here, p=Prob(Δ>1.13), which is the area under H 0 to the right of the green line in the previous figure. Small p-values support H A. Choosing H A is equivalent to p<0.05, so the study result is reported as the p-value. HC is declared to have an effect if p<0.05.
Summary: Factors that Determine Study Size Five factors including power are inter-related. Fixing four of these specifies the fifth: 1. Study size, N. 2. Power (often 80% is desirable). 3. p-value (level of significance, e.g., 0.05). 4. Magnitude of treatment effect to be detected. 5. Heterogeneity among subjects (standard deviation, SD). The next slide shows how these factors (except SD) are typically presented in a study protocol.
Quote from Local Protocol Example Thus, with a total of the planned 80 subjects, we are 80% sure to detect (p<0.05) group differences if treatments actually differ by at least 5.2 mm Hg in MAP change, or by a mean 0.34 change in number of vasopressors.
Comments on Table on Previous Slide Typically power=80% and almost always p<0.05 are fixed. SD was not mentioned. If available, several estimates of SD may be used (different populations, intervention characteristics such as dosage, time, etc). Here, a pilot study exactly like the trial was performed by the investigators. Detectable difference refers to the unknown true difference, μ HC - μ Placebo, not the difference that will eventually be seen in the study. N ↑ as detectable difference ↓. So, the major consideration is usually a tradeoff between N and the detectable difference.
Software for Study Size Calculations Calculations depend on the specific statistical method. We are using the t-test as an example, but the same concepts apply for, say, comparing % subjects who respond to treatment using another method such as a chi-square test. In software, you specify the method, and 4 of the 5 factors. The value of the fifth factor is calculated. Two free sites for calculations:
A Software Site for Study Size Calculations
Local Protocol Example, Calculations Pilot data: SD=8.16 for ΔMAP in 36 subjects. For p-value<0.05, power=80%, N=40/group, the detectable Δ of 5.2 in the previous table is found as:
Summary: Study Size and Power 1. Power analysis assures that effects of a specified magnitude can be detected. 1. Five factors including power are inter-related. Fixing four of these specifies the fifth. 2. For comparing means, need pilot or data from other studies on variability of subjects for the outcome measure. [E.g., Std dev from previous study.] Comparing rates (%s) does not require pilot variability data. Use if no pilot data is available for means. 3. Helps support the believability of (superiority) studies if the conclusions turn out to be negative. 4. To prove no effect (e.g., that a less invasive therapy is equally as effective as standard care), use an equivalency study design.
Self-Test Exercise #1 A study was powered to detect a 10 point mean reduction in LDL cholesterol. A colleague claims that this means that if the subjects decrease LDL cholesterol by a mean 10 points, then p<0.05 and this will be a significant reduction. Explain.
Self-Test Exercise #2 True story: A protocol was designed with 80% power to detect (p<0.05) a 10% disease incidence in subjects receiving placebo vs. a 3.5% incidence in subjects receiving a new drug. This corresponds to a 65% reduction in disease incidence. A comment on the study was: “… there may not be a large enough sample to see the effect size required for a successful outcome. Power calculations indicate that the study is looking for a 65% reduction in incidence of … [disease]. Wouldn’t it also be of interest if there were only a 50% or 40% reduction, thus requiring smaller numbers and making the trial more feasible?” What is your comment on the comment?