Detecting Treatment by Biomarker Interaction with Binary Endpoints Radha A. Railkar & Devan V. Mehrotra, Merck & Co., Inc. JSM – Denver, CO August 1st, 2019
Outline Precision Medicine Introduction TxB Interaction on Which Scale? Logistic Regression (1 and 2 df tests) New methods F* statistics for measuring the strength of the treatment by biomarker interaction on 5 common scales P-value combination methods – harmonic mean p-value and Aggregated Cauchy Association Test Simulation Study & Results Summary
Precision/Personalized/Stratified Medicine Precision medicine’s objective is to optimize a specific preventive, diagnostic or therapeutic intervention in a given subpopulation of patients, which would most likely benefit from it by taking into account patient characteristics such as disease subtype, clinical features and/or biomarkers Precision Medicine approach is the opposite of the “one-size-fits-all” approach, in which disease treatment or prevention strategies are driven with the model of an average person, with few consideration for individuals Precision medicine makes prevention and therapy more efficient or more adapted to each person in taking into account individual differences
Introduction Context: A phase II/III randomized clinical trial with a binary endpoint Treatment A [control] vs. Treatment B [new] Y = subject-level response to treatment (yes/no) p = E(Y) = true proportion of responders T = treatment (0=A, 1=B) B = biomarker subgroup (0=B-, 1=B+) Goal: to determine whether there is a treatment by biomarker (TxB) interaction
TxB Interaction on Which Scale? When the endpoint is binary, different treatment effect measures (e.g., risk difference, relative risk, odds ratio) yield different interaction tests A TxB interaction may exist on the risk difference (proportion) scale, relative risk (log) scale, odds ratio (logit) scale, or on other scales It is important to be able to detect interactions on these different scales as any such interactions may suggest that the new treatment works better/produces harm in a particular biomarker subgroup Need to develop a test against the null hypothesis of no TxB on any scale
Logistic Regression for Detecting TxB Interaction Analysis model: logit(p) = β0 + βTT + βBB + βTBTxB (other covariates can be added) Traditional Approach Test Hnull: βTB=0 (= no TxB interaction on logit scale) 1 df likelihood ratio test Method proposed by Mehrotra et al (JSM 2017 presentation) Test Hnull: βB=βTB=0 (= no TxB interaction on any scale) 2 df likelihood ratio test Why? TxB interactions rarely exist in the absence of B main effects. This is a much more powerful way to discover biomarkers with TxB interactions
F* Statistic for Measuring TxB Interaction on 5 Common Scales Use a statistic analogous to the Brown and Forsythe (1974) F* statistic to measure the TxB interaction on the following 5 common scales: proportion, logit, log, square root and arcsin 𝐹 𝑔 ∗ = 𝑖=0 1 𝑛 𝑖 ∆ 𝑖 − ∆ 2 𝑖=0 1 1− 𝑛 𝑖 𝑁 𝑛 𝑖 𝑉 ∆ 𝑖 Where 𝑛 𝑖 = 𝑛 𝑖0 𝑛 𝑖1 𝑛 𝑖0 + 𝑛 𝑖1 , 𝑁= 𝑖=0 1 𝑛 𝑖 , ∆ 𝑖 =𝑔 𝑝 𝑖1 − 𝑔 𝑝 𝑖0 , where g(.) is the scale of interest, and ∆ = 𝑖=0 1 𝑛 𝑖 ∆ 𝑖 𝑁 ; 𝑉 ∆ 𝑖 can be derived using a first order Taylor Series approximation Each F*g statistic measures whether the treatment effect on a given scale is consistent across the 2 strata Large values of F*g indicate strong evidence of TxB for scale g(.)
F* Statistic (cont’d) Under the null hypothesis of no TxB interaction on scale g(.), 𝐹 𝑔 ∗ is approximately distributed as F(f1, f2), where 𝑓 1 = 𝑖=0 1 𝑛 𝑖 𝑉 ∆ 𝑖 − 𝑖=0 1 𝑛 𝑖 2 𝑉 ∆ 𝑖 𝑁 2 𝑖=0 1 𝑛 𝑖 𝑉 ∆ 𝑖 2 + 𝑖=0 1 𝑛 𝑖 2 𝑉 ∆ 𝑖 𝑁 2 −2 𝑖=0 1 𝑛 𝑖 𝑛 𝑖 𝑉 ∆ 𝑖 2 𝑁 (Mehrotra 1997) 𝑓 2 = 𝑖=0 1 1− 𝑛 𝑖 𝑁 𝑛 𝑖 𝑉 ∆ 𝑖 2 𝑖=0 1 1− 𝑛 𝑖 𝑁 2 𝑛 𝑖 𝑉 ∆ 𝑖 2 𝑛 𝑖 −1 Each F*g statistic measures whether the treatment effect on a given scale is consistent across the 2 strata Large values of F*g indicate strong evidence of TxB for scale g(.)
F* Statistic (cont’d) Each F*g statistic measures whether the treatment effect on a given scale g(.) is constant across the 2 levels of B Large values of F*g indicate strong evidence of TxB for scale g(.) Obtain 5 p-values corresponding to the F*g statistics on each of the 5 scales Testing Strategy: Combine the 5 p-values in order to test the null hypothesis of no TxB on any scale
P-value Combination Methods Harmonic Mean p-value (HMP) (Wilson 2019): The HMP test combines p-values and corrects for multiple testing while controlling the FWER in a way that is more powerful than common methods such as Bonferroni and Simes procedures, more stringent than controlling the FDR, and is robust to positive correlations between tests 𝑝 𝐻𝑀 = 𝑤 𝑖 𝑤 𝑖 𝑝 𝑖 , where 𝑤 𝑖 =1 If 𝑝 𝐻𝑀 ≤ α then reject the null hypothesis of no TxB on any scale Used in big data applications such as GWAS
P-value Combination Methods (cont’d) Aggregated Cauchy Association Test (ACAT) (Liu 2019): Defined as a weighted sum of Cauchy transformation of individual p-values. It is a powerful and computationally efficient p-value combination method to boost power in sequencing studies 𝑇 𝐴𝐶𝐴𝑇 = 𝑤 𝑖 𝑡𝑎𝑛 0.5− 𝑝 𝑖 𝜋 ,where 𝑤 𝑖 =1 =𝑤 𝑝 𝐴𝐶𝐴𝑇 ≈0.5− arctan 𝑇 𝑤 /𝜋 If 𝑝 𝐴𝐶𝐴𝑇 ≤ α then reject the null hypothesis of no TxB on any scale
Simulation Study True Proportions B- (70%) B+ (30%) B-A (Proportion) B-A (Logit) B-A (Log) No TxB Scale A (Control) B (New) B- B+ All (Null) 0.20 0.40 0.98 0.69 None1 0.18 0.28 0.25 0.68 0.10 0.43 0.57 1.85 0.44 1.00 None2 0.26 0.73 0.06 0.53 0.34 2.38 1.29 Arcsin 0.37 0.49 0.83 0.91 0.12 0.08 0.09 Log 0.41 0.92 0.04 0.17 0.86 Logit 0.19 0.75 0.90 0.21 0.15 1.04 1.09 0.74 Proportion 0.70 1.35 Square root N=200 per treatment group
Simulation Results No TxB Scale Logistic Regression (1 df TxB Test) Logistic Regression (2 df B, TxB Test) F* HMP F* ACATP Type I Error Rate % (α=5%) All (Null) 4.96 4.93 4.43 5.36 Power % None1 74.93 99.96 83.38 59.53 None2 98.43 100 99.30 29.91 Arcsin 7.66 12.15 12.01 Log 21.22 10.67 11.00 Logit 5.39 50.94 43.50 Proportion 11.03 30.63 27.06 Square root 17.75 8.36 8.74 N=200 per treatment group; 10,000 simulations
Summary For a binary endpoint different treatment effect measures yield different interaction tests; therefore need to develop a test against the null hypothesis of no TxB on any scale The logistic regression joint (B, TxB) 2 df test is the most powerful for testing the null of no TxB interaction on any scale Tests based on defining a F* statistic to measure the strength of the TxB on each of the 5 common scales and using a p-value combination method are less powerful; the HM p-value combination method has good power for some of the simulated conditions
References Brown, M. B. and Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 16, 129-132. Liu Y. et al. ACAT (2019): A fast and powerful p-value combination method for rare-variant analysis in sequencing studies. American Journal of Human Genetics on ScienceDirect. Volume 104, Issue 3, 7 March 2019, Pages 410-421. Mehrotra, D. V. (1997). Improving the Brown-Forsythe solution to the generalized Behrens-Fisher problem. Communications in Statistics, Simulation and Computation, 26, 1139-1145. Mehrotra, D. V. et al. A Powerful Learn-and-Confirm Pharmacogenomics Methodology for Randomized Clinical Trials. Presentation at JSM 2017. Wilson, D. J. (2019). The harmonic mean p-value and model averaging by mean maximum likelihood. Proceedings of the National Academy of Sciences Jan 2019, 116 (4) 1195-1200; DOI:10.1073/pnas.1814092116.