Statistical Issues in Contraceptive Trials Daniel L. Gillen, PhD Department of Statistics University of California, Irvine FDA Reproductive Drugs Advisory Committee Meeting, Jan 23-24 D. Gillen, FDA Repro, Jan 23-24
Minimum requirements of a clinical trial Appropriate target population Use of appropriate comparison groups Use of appropriate outcome measure Ability to maintain statistical criteria for evidence Controlling type I and II errors in the Frequentist setting D. Gillen, FDA Repro, Jan 23-24
Outline Outcome measures Comparison populations Pearl Index vs. life-table methods Comparison populations Historical vs. active control trials Defining statistical evidence Testing for superiority vs. non-inferiority D. Gillen, FDA Repro, Jan 23-24
Outcome Measures: Pearl Index vs. Life Table Methods D. Gillen, FDA Repro, Jan 23-24
The Pearl Index The Pearl Index (number of pregnancies per 100 woman years) is a common measure used to summarize contraceptive effectiveness However, a drawback of the Pearl Index is that in most situations it is dependent on time and must be interpreted accordingly Such dependence occurs because of the changing baseline risk of pregnancy within study samples as time marches forward D. Gillen, FDA Repro, Jan 23-24
Ex: Sensitivity of Pearl Index to duration of follow-up Suppose our study population consists of two groups “Low risk” group (90% of population): Constant risk of pregnancy 1 year probability of pregnancy is 5% “High risk” group (10% of population): 1 year probability of pregnancy is 50% D. Gillen, FDA Repro, Jan 23-24
Ex (cont’d): One-year Pearl Index Now consider the Pearl Index calculated over the first year Expected number of pregnancies 5000*(0.90*0.05 + 0.10*0.50) = 475 Expected person-years at risk with censoring for pregnancy 4525*1 + 475*.5 = 4762.5 Pearl Index (475 / 4762.5)*100 = 9.97 pregnancies per 100 per year D. Gillen, FDA Repro, Jan 23-24
Ex (cont’d): Two-year Pearl Index For the Pearl Index calculated over 2 years, we need to consider the impact of censoring the “high risk” group at pregnancy By the end of one year Number left in low risk group: 5000*0.90*(1-0.05) = 4275 Number left in high risk group: 5000*0.10*(1-0.50) = 250 Percent of total population in high risk group at one year is 250/4275 = 5.8% D. Gillen, FDA Repro, Jan 23-24
Ex (cont’d): Two-year Pearl Index Now consider the Pearl Index calculated between years 1 and 2 Expected number of pregnancies occurring between 1 and 2 years of follow-up 4525*(0.942*0.05 + 0.058*0.50) = 344.4 Expected person-years at risk between year 1 and year 2 4180.6*1 + 344.4*.5 = 4352.8 person-years Pearl Index calculated between years 1 and 2 (344.4 / 4352.8)*100 = 7.92 pregnancies per 100 per year D. Gillen, FDA Repro, Jan 23-24
Ex (cont’d): Two-year Pearl Index Now consider the Pearl Index calculated over 2 years Expected number of pregnancies observed over 2 years 475 + 344.4 = 819.4 Expected person-years at risk over 2 years 4762.5 + 4352.8 = 9115.3 person-years Pearl Index calculated over 2 years (819.4 / 9115.3)*100 = 8.99 pregnancies per 100 per year D. Gillen, FDA Repro, Jan 23-24
When is the Pearl Index independent of study support? The Pearl Index will change with the length of follow-up unless: The rate of pregnancies is homogeneous across all possible subgroups This rate remains constant with time D. Gillen, FDA Repro, Jan 23-24
When is the Pearl Index independent of study support? In the previous example, it should be noted that even if we allow participants with failures to re-enter the risk set the Pearl Index will still depend upon time This is because a failure results in less at-risk time, thus total years of follow-up will be proportionately less in the “high risk” group as duration of maximal follow-up increases D. Gillen, FDA Repro, Jan 23-24
A further issue in quantifying the Pearl Index… Most confidence intervals for the Pearl Index assume a Poisson Distribution This distribution is defined as having variance equal to the mean (or rate) However, count or rate data is typically characterized as stemming from an overdispersed Poisson distribution That is, the true variance in the rate that we observe is more that we assume from the Poisson distribution Overdispersion in Poisson rates typically arises from heterogeneity of patient populations D. Gillen, FDA Repro, Jan 23-24
Computation of confidence intervals for the Pearl Index Consider our previous example with a “low risk” and a “high risk” group Low risk group (90% of population): Constant risk of pregnancy 1 year probability of pregnancy is 5% High risk group (10% of population): 1 year probability of pregnancy is 50% D. Gillen, FDA Repro, Jan 23-24
Computation of confidence intervals for the Pearl Index We previously calculated the (true) 1 year Pearl Index to be 9.97 pregnancies per 100 per year Suppose that in reality, we observed 457 pregnancies over 1 year with a total of 4763 years of followup, resulting in a Pearl Index of 9.60 per 100 per year Assuming a Poisson distribution the corresponding 95% confidence interval for the 1 year Pearl Index would be (8.73, 10.51) D. Gillen, FDA Repro, Jan 23-24
Computation of confidence intervals for the Pearl Index However, because the Pearl Index is really composed of a mixture of Poisson distributions (from the high and low risk groups) the true variance is actually 19.2% larger than assumed by the usual (single) Poisson model This means that we have underestimated the variance, ie. Our confidence interval is shorter than it should be! In this case, a 95% confidence interval accounting for the heterogeneity of groups is (8.63, 10.55). This is approximately 8% wider than the previous interval D. Gillen, FDA Repro, Jan 23-24
How to deal with the changing composition of the risk set? We illustrated one way in our example Consider the probability of failure at specific time points by using conditional probability For example, if T is the time of failure we can compute the probability of failure within two years as Pr[T<2] = 1-Pr[T>2] = 1 - Pr[T>2|T>1]Pr[T>1] = 1-(1-0.0792)*(1-0.0997) = 0.171 D. Gillen, FDA Repro, Jan 23-24
How to deal with the changing composition of the risk set? This is called a life-table estimate In the setting of contraceptive failure, these conditional probabilities are typically computed monthly to more accurately incorporate the risk set (see eg. Potter, 1966) When the life-table estimate is evaluated at all (distinct) failure times, this is called a Kaplan-Meier estimate. D. Gillen, FDA Repro, Jan 23-24
Are there any benefits of to using the Pearl Index? Clearly, the Pearl Index has been in wide use The reasons for this are Ease of interpretation Although the Kaplan-Meier estimator also has a clinically relevant interpretation (probability of failure over T years of use) For historically controlled trials, there is a great deal of data summarized in terms of the Pearl Index This will, of course, change as the popularity of Kaplan-Meier estimates grow in the field D. Gillen, FDA Repro, Jan 23-24
Can we incorporate changing treatment regiments? Patients may discontinue use or use additional contraceptives for some intervals of time Technically, the Kaplan-Meier estimator could incorporate such left and right censoring. However, it is not clear when patients should re-enter the risk set D. Gillen, FDA Repro, Jan 23-24
Can we incorporate changing treatment regiments? For example, consider the case where a participant uses back-up contraception during the interval (t1, t2). This individual could be considered at risk for the interval (0, t1) then re-entered into the risk set at time t2. However, by doing this we are implicitly making the assumption that this person’s hazard (or risk of pregnancy) at time t2 is the same as all others who have been at risk from (0, t2) This is not a reasonable assumption to me and I would advise against it D. Gillen, FDA Repro, Jan 23-24
Can we incorporate changing treatment regiments? Another option for incorporating changing treatment regiments would come from post-hoc analyses Stratified Kaplan-Meier estimates Number of strata could become large Time-dependent covariates Eg. Consider a proportional hazards framework D. Gillen, FDA Repro, Jan 23-24
Regardless of the measure, what defines a failure and who is at risk? For all new interventions we must consider: Safety: Are there adverse effects that clearly outweigh any potential benefit? Efficacy: Can the intervention reduce the probability of unintended pregnancy in a beneficial way? Effectiveness: Would adoption of the intervention as a standard reduce the probability of unintended pregnancy in the population? D. Gillen, FDA Repro, Jan 23-24
Regardless of the measure, what defines a failure and who is at risk? One difference between evaluation of efficacy and effectiveness is in what defines a failure and who should be included in the risk set In a clinical trial setting we can truly only evaluate efficacy because of possible selection bias of patients entering contraceptive trials However, even in the clinical trial setting it is useful to evaluate Intervention failure rates during actual use (including inconsistent or incorrect use) Intervention failure rates during perfect use (see eg. Trussell, Contraception, 2004) D. Gillen, FDA Repro, Jan 23-24
Regardless of the measure, what defines a failure and who is at risk? To assess true method efficacy, counting only “method failures” during perfect use, we must only include perfect use exposure patients in the risk set Also, need to consider if those who are lost to follow-up should be considered at risk all the way up to the time of drop-out One reasonable approach is to censor patients three months prior to the time at which they become lost to follow-up (Trussell, SIM, 1991) D. Gillen, FDA Repro, Jan 23-24
Historical vs. Active Control Trials D. Gillen, FDA Repro, Jan 23-24
Historical control trials vs. active control trials In the past many methods have been assessed via a historical control trial Eg. Criteria such as a Pearl Index of 1.5 (or more recently 2) or less has been used an efficacy criteria Such criteria stems from the experience of historical controls However, biases resulting from historical control studies can be numerous. Particularly when study samples are not comparable with respect to baseline risk, evaluative measure of outcome, or duration of study. D. Gillen, FDA Repro, Jan 23-24
Criteria for superiority in historical control trials As noted, past studies have considered point estimates of the (one year) Pearl Index of less than 1.5 or 2 unintended pregnancies per 100 per year However, we must also acknowledge uncertainty of these estimates EMEA requires sufficient sample size to guarantee the width of the 95% CI for the Pearl Index to be no larger than 1 Better (in my opinion) to require that upper bound of CI is less than the chosen threshold In either case, if the Pearl Index is used the previous notes on computation of the CI need to be considered D. Gillen, FDA Repro, Jan 23-24
Historical control trials vs. active control trials Because it is impossible to guarantee comparability between historical controls and current study samples, it is almost always advantageous to employ randomization when ethically feasible Given a wide use of standard contraceptives, it is not feasible to consider a placebo controlled trial However, one can (and should) consider the use of an active control when comparable interventions are in use Also allows for comparison of entire survival curve (logrank test or proportional hazards model?) D. Gillen, FDA Repro, Jan 23-24
Superiority vs. Non-Inferiority in Active Control Trials D. Gillen, FDA Repro, Jan 23-24
Superiority vs. non-inferiority in active control trials Statistical criteria for evidence in a superiority trial Evidence to rule out equality of effect as measured by the chosen parameter (eg. Pearl Index, 1-year survival estimate, or a hazard ratio) Example: Contrast may be difference in 1-year failure rates as measured by the Kaplan-Meier estimator KMTx(1) - KMAC(1) Test: H0: KMTx(1) - KMAC(1) 0 Vs. H1: KMTx(1) - KMAC(1) < 0 Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than 0 D. Gillen, FDA Repro, Jan 23-24
Superiority vs. non-inferiority in active control trials Statistical criteria for evidence in a non-inferiority trial Evidence to rule out some margin of efficacy less than the active control Example: Contrast may be difference in 1-year failure rates as measured by the Kaplan-Meier estimator KMTx(1) - KMAC(1) Test: H0: KMTx(1) - KMAC(1) Vs. H1: KMTx(1) - KMAC(1) < for some > 0 Rejection of null hypothesis corresponds to upper bound of CI for KMTx(1) - KMAC(1) being less than D. Gillen, FDA Repro, Jan 23-24
Superiority vs. non-inferiority in active control trials When is it reasonable to consider non-inferiority instead of superiority? ICH E-10 Guidelines Active control treatment must truly be active in the study population If active control is truly active in the study population Can a margin to define non-ineferiority be established? If active control is standard of care, is new treatment also superior on secondary endpoints? D. Gillen, FDA Repro, Jan 23-24
Superiority vs. non-inferiority in active control trials Issues in setting the non-inferiority “margin”? What measure compares distributions? Is the treatment effect random? How much of a decrease in effect is acceptable? How to account for variability in the estimate(s) from historical trials? D. Gillen, FDA Repro, Jan 23-24
Superiority vs. non-inferiority in active control trials Precedence for setting the non-inferiority “margin” Is the treatment effect random? Ideally use meta-analysis of multiple trials Careful! Do trials have same duration of follow-up? How much of a decrease in effect is acceptable? 10%, 20%, 50% of active control effect? How to account for variability in the estimate(s) from historical trials? Use worst case from historical 95% CI? Explicitly account for variability in historical trial D. Gillen, FDA Repro, Jan 23-24
Summary D. Gillen, FDA Repro, Jan 23-24
Summary Need to define appropriate target population, comparison group, outcome measure, and maintain statistical criteria for evidence Pearl Index is (usually) implicitly dependent on the length of follow-up, whereas Kaplan-Meier (life table) estimates make this dependence explicit In either case, we need to obtain correct inference (CI’s) and the definition of the risk set must correspond to the definition of failure When ethically and logistically possible, active controls should be used If historical controls are used, uncertainty should be accounted for in defining superiority criteria D. Gillen, FDA Repro, Jan 23-24