WHICH PSM METHOD TO USE? THE ASSOCIATION BETWEEN CHOSEN PROPENSITY SCORE METHOD AND OUTCOMES OF RETROSPECTIVE REAL-WORLD TREATMENT COMPARISIONS: EVALUATION OF 18 DIFFERENT PSM METHODS Antje Groth, Sabrina Mueller, Thomas Wilke (Ingress-Health HWM GmbH, Wismar, Germany) BACKGROUND AND OBJECTIVES Propensity score (PS) matching (PSM) is a widely used and excepted method for creating balanced sets of exposed and non-exposed patients to estimate treatment outcomes in retrospective real-world settings. A wide range of different approaches can be applied when using PSM. We examined 18 different matching methods, to assess whether different PSM methods differ in terms of matching quality or study results. We used an anonymized claims dataset of type-2-diabetes-mellitus patients, who were initially treated with Sulfonylureas (SU; n=904) or Metformin (MET; n=7,874) in monotherapy (first prescription of respective agents: index date). METHODS Three different sets of baseline variables were optionally used for PS calculation: All available variables (age, gender, Charlson Comorbidity Index (CCI), adopted Diabetes Complications Severity Index (aDCSI), number of general practitioner visits, any previous observed micro-/macrovascular complications and prescription of antithrombotic, antihypertensive or lipid lowering medication), Only variables significantly associated with group exposition, Age, gender and CCI only. To these, we applied two different matching algorithms: optimal 1:1 matching without replacement (O) and nearest neighbor 1:1 matching with replacement (NN). Caliper widths were defined as fixed (0.001) or determined by PS (0.2*standard deviation of log(PS)). In a further scenario with regard to PSM framework, PSM was done within 5-year-age and gender classes. In sum, 18 different PSM variation sets could be derived (Table 1). Matching quality was assessed by comparing differences in: number of matched patients in relation to the unmatched patient number, baseline characteristics similarity (Chi², t-test; nb. of variables with significant differen-ces between the samples), absolute standardized bias reduction in baseline characteristics (bias of the variables before vs. after matching), and non-study-related outcomes: pneumonia/arm fracture/back pain rates (Chi²). The four quality parameter were quantified as percentage value for each PSM variation and the calculated average per variation set resulted in a ranking from 1 (best) to 15 (worst). Outcome analysis: Observation started with the date of the first observed MET/SU prescription; follow-up time for each patient was at least 12 months (with death as an exception) and lasted until the first observed event, death, therapy discontinuation (treatment gap >180 days or prescription of another agent) or the end of observational period, whichever came first. All patients were followed with regard to the following events: All-cause death and Major adverse cardiovascular events (MACE), defined as hospitalization with one of the following events: Stroke (ICD-10 codes: I60.-/I61.-/I62.-/I63.-/I64.-), Acute myocardial infarction (ICD-10 code: I21.-), Congestive heart failure(ICD-10 code: I50.-), Coronary revascularizations (OPS 5-361/5-362/5-363), Percutaneous transluminal vascular interventions and stent implantations (OPS 8-836/8-837/8-84), Peripheral vascular disease (ICD-10 code: I73.9) or Angina pectoris (ICD-10 code: I20.-). Table 1 – 18 different PSM methodology options Matching algorithm Optimal matching without replacement (O) Nearest neighbor matching with replacement (NN) Variables included (PS) All available variables Variables significantly associated with group exposition Age/gender/CCI Significant variables Defined caliper widths 0.001 0.2* SD (log (PS)) Matching framework PS PS+ age/ gender classes Option number O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 NN1 NN2 NN3 NN4 NN5 NN6 RESULTS Matching quality: Within the 18 PS matchings, the number of determined matched pairs varied between 726 and 904. Six PSM sets reached the maximum of 904 matched pairs (#O3, O7, O11, NN1, NN3, NN5); two sets reached less than 750 matches (#O2, O6). Percentage of baseline characteristics variables significantly different between patients in the different PSM sets (variable different if p-value <0.05) ranged from 0-40%. In method set #O6, no variable describing a baseline characteristic differed significantly between PSM samples; in sets #O3 and O4 still four out of 10 baseline characteristic variables had significantly different values in the respective PS samples. Percentage of baseline characteristics with bias reduction between unmatched and matched samples ranged from 60-80%. Best results with only 20% bias deterioration had sets #O1, O3, O4, O5, NN1 and NN2. In 13 of the defined 18 PSM sets no significant difference between non-study-related outcomes could be found, in the remaining sets one out of five different outcome variables showed significant differences (Table 2). Outcome analysis: Before matching, relative risk of all-cause death (2.71) and MACE (1.66) were significantly higher in SU vs. MET treated T2DM patients. After matching, only in 10 out of the 18 comparisons all-cause mortality showed a significant difference between SU/MET exposition. For MACE, only four of the PSM sets showed significant differences. The outcome analysis for the three PSM sets with the best matching quality (#O5, O1 and NN5) resulted in significant differences in all-cause mortality and no difference in MACE. Mortality rate per observed patient-year ranged between 0.070-0.125 (MET) and 0.092-0.131 (SU); relative risk was 1.05-1.47. MACE rate per patient-year varied from 0.061-0.110 (MET) and 0.086-0.097 (SU), relative risk was between 0.85 and 1.45. Table 3 – Outcomes comparison* All-cause death MACE Rate/person-year Relative risk: SU/MET (p-value) PSM variation set Ranking number MET SU unmatched 0.048 0.131 2.71 (<0.001) 0.057 0.094 1.66 (0.002) O1 2. 0.096 0.126 1.31 (0.022) 0.073 0.097 1.33 (0.078) O2 15. 0.07 0.092 1.32 (0.045) 0.067 1.37 (0.063) O3 12. 1.37 (0.008) 0.065 1.44 (0.036) O4 13. 0.107 0.127 1.18 (0.091) 0.066 0.095 1.45 (0.028) O5 1. 0.084 0.121 1.44 (0.002) 0.074 1.27 (0.194) O6 11. 0.077 1.26 (0.097) 0.061 0.086 1.41 (0.076) O7 1.38 (0.012) 0.075 1.26 (0.206) O8 9. 1.30 (0.024) 1.45 (0.036) O9 6. 0.123 1.27 (0.050) 1.42 (0.052) O10 7. 0.083 0.103 1.24 (0.114) 0.064 1.43 (0.053) O11 0.104 1.26 (0.042) 1.41 (0.052) O12 4. 0.102 1.25 (0.046) 0.069 NN1 8. 0.125 1.05 (0.359) 1.27 (0.097) NN2 10. 0.118 1.08 (0.274) 1.37 (0.045) NN3 0.12 1.09 (0.359) 1.13 (0.429) NN4 14. 0.111 0.124 1.11 (0.304) 1.33 (0.134) NN5 3. 0.093 1.41 (0.008) 0.11 0.85 (0.439) NN6 5. 1.47 (0.005) 0.89 (0.548) Table 2 – PSM quality assessment* Quality parameters un-matched O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 NN1 NN2 NN3 NN4 NN5 NN6 Samplesize MET 7,874 865 735 904 892 870 726 891 883 834 894 878 881 890 SU Difference baseline characteristics (nb. of variables with p-value <0.05) 6/10 1/10 3/10 4/10 0/10 2/10 Nb. of variables with bias reduction (baseline characteristics) - 8/10 7/10 Difference non-study-related outcomes (nb. of variables with p-value >0.05) 10/10 Ranking 2 15 12 13 1 11 9 6 7 4 8 10 14 3 5 *Green boxes indicate best results per parameter, red boxes worst results. Highest impact on matching quality showed caliper definition, whereby the fixed PSM-caliper of 0.001 resulted, on average, in a better matching quality (#O1/3/5/7/9/11, NN1/3/5). Matching within fixed age and gender classes did not led to a quality improvement (#O2/4/6/8/10/12). The impact of the variables definition for calculating the PS played no role or was superimposed by the effect of other parameters. Different matching algorithms resulted both in very good and very bad matching quality, so we conclude that both can be used equally. Best matching quality was achieved by using the optimal algorithm with a PS-caliper of 0.001 without matching within pre-defined age and gender classes (#O5). *Red boxes indicate statistical insignificance. CONCLUSION AND RECOMMENDATIONS Different PSM methods are associated with different matching quality and strongly affect the outcomes of retrospective comparative analyses. We recommend to (1) carefully choose the used PSM method, and (2) to apply different PSM methods in scenario analyses to test robustness of study results. Finally, we propose to describe in detail the used methods to improve understanding and interpretation of published study results. www.ingress-health.com www.twitter.com/ingressh info@ingress-health.com www.linkedin.com/company/ingress-health