Sample size and analytical issues for cluster trials David Torgerson Director, York Trials Unit
Background For any trial we want to make it sufficiently large that if there were a ‘true’ difference between the groups that this difference would be statistically significant. A Type II error occurs when we wrongly conclude there is no difference when there actually is.
Sample size calculations “Most hand calculations diabolically strain human limits, even for the easiest formula,..” (Schulz & Grimes, Lancet 2005)
Sample size formulae Usually need a computer to calculate. However, a simple approximation for a two armed randomised trial with 1:1 ratio for a continuous variable (e.g., blood pressure) is as follows d = effect size (difference/standard deviation):
Example We want to investigate a treatment for back pain. The measure is the Roland and Morris back pain scale with a standard deviation of 4. If we want to detect a 2 point difference how many do we need? 2/4 = 0.5 = Effect size (d). 0.5 x 0.5 = /0.25 = 128 in total for 80% power, 5% significance (use 42 for 90% power). NB using computer software answer = 126
Binary variables For a dichotomous variable (cured not cured) the following is useful (a = average proportion difference).
Example Breast feeding rates are only 50% and we have an educational intervention where we think this will increase to 60%; how many do we need? d 2 = = = 0.01 a = /2 = 0.55 a 2 = = /( ) = /0.040 = 792 Need 792 to have 80% power to show a 10% difference in breast feeding rates if it were present (use 42 for 90% power). NB using computer software the answer is: 774
Approximations The formulae slightly overestimate the true sample size needed. But they can be done on a hand calculator and you can impress the statisticians. What about cluster trials?
Cluster Sample Size Usual sample size estimates assume independence of observations. When people are members of the same cluster (e.g., classroom, GP surgery) they are more related than we would expect to be at random. This is the intra-cluster correlation co- efficient.
ICC The ICC needs to incorporated into the sample size calculations. The formula is as follows: Design effect = 1 + (m – 1) X ICC. Design effect is the size the sample needs to be inflated by. M is the number of people in the cluster.
Sample size example. Let’s assume for an individually randomised trial we need 128 people to detect 0.5 of an effect size with 80% power (2p = 0.05). Now assume we have 24 groups with 7 members. The ICC is 0.05, which is quite high. 1+ (7 – 1) x 0.05 = 1.3, we need to increase the sample size by 30%. Therefore, we will need 166 participants.
What happens if cluster gets bigger? If our cluster size is twice as big (14), things begin to get really interesting. 1+(14-1)x0.05 = What about 30? (1+(30-1)x 0.05 = 2.45 (I.e, 314 participants). Say we randomise a larger cluster, such as a school (n = 500) (1+(500-1) x 0.05 = (ie. 3322).
ICC size ICCs can be large for some things. ICCs for educational outcomes for examples are often around 0.4 to 0.5. A class-based RCT with n = 30 and an ICC of 0.4 would need 1,612 participants or 54 classes with n = 30 in each class.
What makes the ICC large? If the treatment is applied to health care provider (e.g., guidelines will increase ICCs for patients). If cluster relates to outcome variable (e.g., smoking cessation and schools) If members of cluster are expected to influence each other (e.g., households).
AuthorsSourceYears Clustering allowed for in sample size Clustering allowed for in analysis Donner et al. (1990) 16 non-therapeutic intervention trials 1979 – 1989 <20%<50% Simpson et al. (1995) 21 trials from American Journal of Public Health and Preventive Medicine 1990 – %57% Isaakidis and Ioannidis (2003) 51 trials in Sub-Saharan Africa 1973 – 2001 (half post 1995) 20%37% Puffer et al. (2003) 36 trials in British Medical Journal, Lancet, and New England Journal of Medicine 1997 – %92% Eldridge et al. (Clinical Trials 2004) 152 trials in primary health care %59% Reviews of Cluster Trials
Sample Size Problems Cluster Trials Demand Larger Sample Sizes
Conditional ICC The key ICC is the conditional ICC, usually we only have access to estimates of the unconditional ICC. If we know, and can measure, characteristics that cause the ICC, we can adjust for this and lower the ICC. Cook claims that using covariates allows a school based RCT to reduce the number for schools from about 50 to around 22.
Summary of sample size The KEY thing is the size of the cluster. It is nearly always best to get lots of small clusters than a few large ones (e.g, a trial with small hospital wards, GP practices, classrooms will, ceteris paribus, be better than large clusters). BUT if the ICC is tiny may not affect the sample too much.
Cluster Trials: Should I do one? If possible avoid like the plague. BUT although they are difficult to do, properly, they WILL give more robust answers than other methods, (e.g., observational data), when done properly. Is it possible to avoid doing them and do an individually randomised trial?
Contamination An important justification for their use is SUPPOSED ‘contamination’ between participants allocated to the intervention with people allocated to the control.
Spurious Contamination? Trial proposal to cluster randomise practices for a breast feeding study – new mothers might talk to each other! Trial for reducing cardiac risk factors patients again might talk to each other. Trial for removing allergens from homes of asthmatic children.
Contamination Contamination occurs when some of the control patients receive the novel intervention. It is a problem because it reduces the effect size, which increases the risk of a Type II error (concluding there is no effect when there actually is).
Patient level contamination In a trial of counselling adults to reduce their risk of cardiovascular disease general practices were randomised to avoid contamination of control participants by intervention patients. Steptoe. BMJ 1999;319:943.
Accepting Contamination We should accept some contamination and deal with it through individual randomisation and by boosting the sample size rather than going for cluster randomisation Torgerson BMJ 2001;322:355.
Counselling Trial Steptoe et al, wanted to detect a 9% reduction in smoking prevalence with a health promotion intervention. They needed 2000 participants (rather than 1282) because of clustering. If they had randomised 2000 individuals this would have been able to detect a 7% reduction allowing for a 20% CONTAMINATION. Steptoe. BMJ 1999;319:943.
Comparison of Sample Sizes NB: Assuming an ICC of 0.02.
Misplaced contamination The ONLY health study, I’m aware of to date, to directly compare an individually randomised study with a cluster design, showed no evidence of contamination. In an RCT of nurse led cardiovascular risk factor screening some ‘intervention’ clusters had participants allocated to no treatment. NO contamination was observed.
What about dilution bias? If, in the presence of contamination, we use individual allocation we might observe a difference that is statistically significant but is not clinically or economically significant. Dilution has biased the estimate towards the mean.
Dealing with contamination Sometimes there may be substantial contamination and this will dilute the treatment effects, it may, however, still be best to individually randomise if you can measure contamination.
Per-protocol analysis? We cannot adjust for contamination using either per-protocol or on treatment analysis: these popular analytical methods are plainly wrong as they violate the random allocation.
CACE analysis: a solution? If we can measure contamination we can use a statistical approach known as Complier Average Causal Effect (CACE) analysis.
Assumptions of CACE Assumption 1 – if the control group had been offered treatment the same proportion would comply with treatment – this must be true as random allocation ensures that it is. Assumption 2 – merely being offered treatment has no effect on outcomes.
Example CRC screening In a RCT of bowel cancer screening only 53% of people invited for screening attended. ITT = relative risk = BUT what happened to those who were screened? The per protocol RR was 0.62 THIS IS WRONG. What is the true estimate?
Randomisation Observed adherers n = 40,214 (53%) Outcome = 138 = 0.34% Observed non-adherers n = 35,039 (47%) Outcome = 222 = 0.63% Intervention group (n = 75,253) Potential adherers n = 40,078 (53%) Unobserved outcome = 199 = 0.50 % Potential non-adherers n = 34,920 (47%) Unobserved outcome = 221 = 0.63% Control group (n = 74,998)
True differences For ITT the policy of offering screening to the whole community the RR = 0.85, that is a 15% reduction in CRC deaths. For those who accepted screening their RR was 0.68 – a 32% reduction in deaths, NOT a 38% reduction.
Individuals are best Using CACE we can get the best of both worlds retain individual randomisation and get unbiased estimates.
Sample size simulation CACE analysis generally produces wider confidence intervals as there are two sources of variance. Therefore, it is possible that cluster allocation may actually have a lower standard error in some circumstances. To assess whether this is true we undertook a simulation exercise.
Cluster Size ICC = 0.04, Cluster trial Contamination (%) Individual RCT with CACE Contamination effect NB 80% power to detect an effect size of 0.2 Source: Hewitt PhD thesis. Sample size Trade-off between cluster and individual allocation
Sample size CACE performs better than cluster allocation in a range of sample size scenarios Because of the difficulties of doing a cluster trial then an individual trial design with CACE analysis might be best.
Limitations The assumption that being offered treatment has no effect is a weakness as some may appear not to comply but actually access some of the treatment.
Still need to do a cluster trial? If a cluster trial is be undertaken it is important, once the trial has been completed that it is analysed correctly and that the effect of the clustering is accounted for. This has been known since 1940, when Linquist advocated that educational trials should use the class as the natural unit of allocation.
What did Lindquist proposed Each class should be treated both as the unit of allocation and the unit of analysis. Put simply a trial with 20 classes of 30 children is NOT a trial of 600 children it is a trial of 20 classes. The simplest approach is to calculate the mean score of each cluster and do a t-test comparing the two means.
Example A randomised trial of 28 adult literacy classes sought to ascertain whether or not paying participants an incentive to attend would improve adherrence. 14 classes were randomised for students to get an incentive 14 were controls. Students were paid £5 per class attended There were 150 students in total the ICC was See Martin Bland’s website for a worked examplehttp://www-users.york.ac.uk/~mb55/
Two-sample t test with equal variances Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Group X | Group Y | combined | diff | diff = mean(Group X) - mean(Group Y) t = Ho: diff = 0 degrees of freedom = 150 Ha: diff 0 Pr(T |t|) = Pr(T > t) =
Wrong This analysis is wrong it treats all of the students as individuals and ignores the clustering of outcomes between the two approaches. Let us try Lindquist’s approach to the anlaysis.
Two-sample t test with equal variances Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] | | combined | diff | diff = mean(1) - mean(2) t = Ho: diff = 0 degrees of freedom = 26 Ha: diff 0 Pr(T |t|) = Pr(T > t) =
T-test method This is correct in the sense that it takes clustering into account, however, it does not take chance differences in cluster size into account or powerful predictors of outcome. We have information of cluster size and pre-test literacy score we can use to improve the precision of our estimate (i.e., reduce width of the confidence intervals). We can use summary statistics in a regression approach
Source | SS df MS Number of obs = F( 2, 25) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = sessions | Coef. Std. Err. t P>|t| [95% Conf. Interval] group | midscl | _cons |
Other methods There are other statistical methods, that are more complex, and may yield slightly different results. However, simple methods are approximately correct and easier to do.
Summary Cluster trials need larger sample sizes than individually randomised studies. Clustering needs to be taken into account both in the sample size and the analysis. There are simple methods that can do this.