Biostatistics in Practice Peter D. Christenson Biostatistician Session 4: Study Size for Precision or Power
Session 4 Issue How many subjects?
Session 4 Preparation We have been using a recent study on hyperactivity in children under diets with various amounts of food additives for the concepts in this course. The questions below based on this paper are intended to prepare you for session 4, which is on determining the size of a study. 1.How many children were deemed necessary to complete the entire study? Use the second column on the 4th page of the paper.
Session 4 Preparation #1
Session 4 Preparation #2 2. The authors accounted for some children to start, but not complete the study. What percentage of "dropouts" did they build into their calculations? The statistical requirements are for 80 “evaluable” subjects. They decided on a study size of 120, so they were allowing up to 40/120 = 33% of subjects to not complete.
Session 4 Preparation #3 3. The authors will perform a test similar to the t-test we discussed last week, to conclude whether there is evidence that hyperactivity differs under Mix A than placebo. There are two mistakes that they may make in this decision. What are they? I.Conclude Mix A ≠ Placebo, but Mix A = Placebo II.Conclude Mix A = Placebo, but Mix A ≠ Placebo
Session 4 Preparation #4 and #5 4. How large a difference between Mix A and placebo do they want to detect? 5. Does the value of 0.32 in the study size description (second column on the 4th page) refer to a difference? They seem to imply it is a SD. Based on what we have said about tests comparing "signal" to "noise", do you think both a difference and SD are relevant for determining the study size?
Session 4 Preparation: #4 and #5
Session 4 Preparation #4 and #5 They want to detect a difference Δ of 0.32 in GHA. [ Smallest clinically relevant Δ? ] Both the Δ and SD need to be accounted for. Effect size = Δ / SD = “# of SDs”. Remember, reference range = 4 to 6 SDs. For this study (unusual) GHA is scaled to have a SD of 1, so Δ = effect size =0.32.
Session 4 Goals Review estimating and testing Δ, SD and N in estimating and testing False positive and false negative conclusions from tests What is needed to determine study size Software for study size
Review Estimation Typically: 1.Have sample of N representing “all”. 2.Find mean and SD from the N units. 3.Expect new unit to be within mean ± 2SD. 4.Confident (95%) that mean of all is in mean ± 2SD/√N. May have this info for one or multiple groups.
Study Size to Achieve Precision Precision refers to how well a measure is estimated. Margin of error = the ± value (half-width) of the 95% confidence interval. Lower margin of error ↔ greater precision. To achieve a specified margin of error, say d, solve the CI formula for N: For a mean, d = 2SD/√N, so N=(2SD/d) 2. For a proportion p, d = 2[p(1-p)/N] 1/2 ≤ 1/√N. Most polls use N ≈ 1000, so margin of error on % ≈ 3%
Review Statistical Tests 1. Calculate a standardized quantity for the particular test, a “test statistic”: Often: t = (Mean – Expected) / SE(Mean) If 1 group, Mean may be a change score. If 2 groups, Mean may be the difference between means for two groups. Expected = 0 if no effect. Looking for evidence to contradict “no effect”. Rarely: Mean is not a Δ and Expected ≠ 0.
Review Statistical Tests 2.Compare the test statistic to the range of values it should be if expectations are correct. Often: The range has approx’ly normal bell curve. 3.Declare “effect” if test statistic is too extreme, relative to this range. Often: |test statistic| >~2 → Declare effect.
t-Test Expect 95% Chance Declare effect if test statistic is “too extreme”. How extreme? Convention: “Too extreme” means < 5% chance of wrongly declaring an effect. 2.5% Effect No Effect Effect Declare: t = (mean – expected) SD/√N
t-Test Expect 95% Chance Declare effect if test statistic is “too extreme”. Convention: “Too extreme” means < 5% chance of wrongly declaring an effect. But, what are the chances of wrongly declaring no effect? 2.5% Effect No Effect Effect Declare:
t-Test Expect 95% Chance Declare effect if test statistic is “too extreme”. But, what are the chances of wrongly declaring no effect? To answer, we need a similar curve for the range of values expected when there is an effect. 2.5% Effect No Effect Effect Declare:
Two Possible Errors from t-test No Effect Real Effect No real effect (0) Real effect = 3 Effect in study=1.13 \\\ = Probability: Conclude Effect, But no Real Effect (5%). /// = Probability: Conclude No Effect, But Real Effect (41%). 41% 5% Δ = Effect (Difference Between Group Means) RedBlue Green Just Δ, not t = Δ/SE(Δ)Conclude effect. Consider just one possible real effect, the value 3.
Graphical Representation of t-test No Effect Real Effect No real effect (0) Real effect = 3 Effect in study= % 5% Δ = Effect (Difference Between Group Means) RedBlue Green Just Δ, not t = Δ/SE(Δ)Conclude effect. Suppose we need stronger proof; i.e., shift cutoff to right. Then, chance of false positive is reduced to ~1%, but false negative is increased to ~60%.
Power of a Study Statistical power is the sensitivity of a study to detect real effects, if they exist. It is =59% two slides back.
Truth: No DiseaseDisease No Disease Disease Diagnosis: Correct Error Want high for a screening test Need high in follow-up test Specificity Sensitivity Two Possible Errors in a Diagnostic Test Specificity ↓ as Sensitivity↑
Truth: No EffectEffect No Effect Effect Study Claims: Correct Error (Type I) Error (Type II) Power: Maximize. Choose N for 80% Set α=0.05 Specificity=95% Specificity Sensitivity Analogy with Diagnostic Testing ← Typical →
Summary: Factors Related to Study Size Five factors are inter-related. Fixing four of these specifies the fifth: 1. Study size, N. 2. Power (often 80% is desirable). 3. p-value cutoff (level of significance, e.g., 0.05). 4. Magnitude of the effect to be detected (Δ). 5. Heterogeneity among subjects (SD). The next slide shows how these factors (except SD) are typically presented in a study protocol.
Quote from Local Protocol Example Thus, with a total of the planned 80 subjects, we are 80% sure to detect (p<0.05) group differences if treatments actually differ by at least 5.2 mm Hg in MAP change, or by a mean 0.34 change in number of vasopressors.
Comments on the Previous Table Typically power=80% and almost always p<0.05. SD was not mentioned. There may be several estimates from other studies (different populations, intervention characteristics such as dosage, time, etc). Here, a pilot study exactly like the trial was performed by the same investigators. Detectable difference refers to the unknown true difference for “all”, not the difference that will be seen eventually in the N study subjects. N ↑ as detectable difference ↓. So, the major consideration is usually a tradeoff between N and the detectable difference.
Free Study Size Software
Local Protocol Example: Calculations Pilot data: SD=8.16 for ΔMAP in 36 subjects. For p-value<0.05, power=80%, N=40/group, the detectable Δ of 5.2 in the previous table is found as:
Hyperactivity Study Size Study is 1-sample or paired (for each age group). SD=1 Δ=0.32 Use p-value<0.05. Want power=80%. Solve for N in software to get N=79.
Study Size for Some Other Study Types 1.Phase I: Dose escalation. Safety, not efficacy. No power. Use N=3 low dose; if safe N=3 in higher dose, etc. 2.Phase II: Small, primarily safety; look for enough evidence of efficacy to go on to Phase III. Often staged: e.g., if 3/10 respond, test 10 more, etc. 3.Mortality studies: Patterns of deaths over time can be used in sample size calculations. Software not in the online package.
Summary: Study Size and Power 1.Power analysis assures that effects of a specified magnitude can be detected. 2.Five factors including power are inter-related. Fixing four of these specifies the fifth. 3.For comparing means, need pilot or data from other studies to estimate SD for the outcome measure. Comparing %s does not require SD. 4.Helps support the believability of studies if the conclusions turn out to be negative.