Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19
Efficient Active Learning for Electronic Medical Record De- identification Privacy and Bias in Data Science S45 Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Disclosure I disclose the following relevant relationship with commercial interests: I am an employee of Privacy Analytics, an IQVIA company, which builds and markets de- identification products 2019 Informatics Summit | amia.org

Learning Objectives After participating in this session the learner should be better able to: Understand the challenge posed by the human annotation cost in de-identifying natural language data, and be able to discuss several active learning algorithms to reduce such expenditure in de-identifying electronic medical records and clinical study reports. 2019 Informatics Summit | amia.org

What is De-identification?
Protected Health Information (PHI) removal Direct identifiers (DI) Name SSN Medical record number … Quasi-identifiers (QI) DOB Zip-code Gender Ethnicity … Challenge for natural language data Original* Patient – SAE (pneumonia) Patient details: 49-year-old, female, Caucasian The patient entered the study on JUN Past medical history included thyroidectomy OCT … De-identified Patient [SUBJID] – SAE (pneumonia) Patient details: [AGE]-year-old, [SEX], [RACE] The patient entered the study on [DATE]. Past medical history included [MEDICAL_HISTORY] [DATE]. … 2019 Informatics Summit | amia.org

Automated De-identification Tools
Rule based Local knowledge, hand-crafted rules Not always easy to gather Machine learning based Models inferred automatically from annotated training data A certain amount of human annotation Hybrid Integrate the best of both Also needs both local knowledge and human annotation 2019 Informatics Summit | amia.org

Machine Learning Based De-identification
Named Entity Recognition Conditional Random Fields (CRFs) Capture dependencies between type labels  what we use in this work (MIST) Recurrent Neural Networks (RNNs) Do not need handcrafted features or rules, can automatically extract features 2019 Informatics Summit | amia.org

Machine Learning De-identification Workflow
Unannotated Natural Language Data Random Sample Sampled data to annotate by human Protected Data Gold Standard Data Models PHI Detection Annotation Model Training Human Detection Tools 2019 Informatics Summit | amia.org

Scalability Challenge
De-identification models should not be used off-the-shelf 2016 CEGS N-GRID shared tasks: best F-measure around 0.8 Appropriately trained systems I2b challenge: best F-measure of 0.964 Best F-measure of via a deep neural network Can we engineer a system to learn faster? State-of-the-art: constantly needs sufficiently high-quality training data Poorly trained systems mean more time/cost in human correction 2019 Informatics Summit | amia.org

Why Active Learning? Hypothesis
Machine learning model Oracle (human annotator) Unlabeled pool U Labeled training set Learn a model Hypothesis More informative data actively requested Less training data needed Performance maintained or even improved Select queries 2019 Informatics Summit | amia.org

Active Learning De-identification Workflow
Heuristics Selection criteria Unannotated Natural Language Data Next batch of data to annotate by human Random Sample Query Initial batch of data to annotate by human Protected Data Gold Standard Data Annotation Model Training Models PHI Detection Human Detection Tools 2019 Informatics Summit | amia.org

Selection Criteria Heuristics
Least Confidence with Upper Bound (LCUB) 𝑓 𝑥 𝑡 , 𝑓 𝑥 𝑡 = 1− 𝑃 𝑦 𝑡 𝑥 𝑡 , 𝑃 𝑦 𝑡 𝑥 𝑡 <𝜃 0, 𝑃 𝑦 𝑡 𝑥 𝑡 ≥𝜃 Entropy with Lower Bound (ELB) 𝐻 𝑥 𝑡 =− 𝑗 𝑃 𝑦 𝑡𝑗 𝑥 𝑡 𝑙𝑜𝑔𝑃 𝑦 𝑡𝑗 𝑥 𝑡 𝑔 𝑥 𝑡 , 𝑔 𝑥 𝑡 = 𝐻 𝑥 𝑡 , 𝐻 𝑥 𝑡 > 𝜌 0, &𝐻 𝑥 𝑡 ≤ 𝜌 Return On Investment (ROI) 𝑡 𝑅𝑂𝐼 𝑥 𝑡 , expected ROI of token 𝑥 labeled as non-PHI 𝑐𝑛 𝑛 − 𝑐𝑡 𝑛 ×𝑃 𝑦 𝑥 × 1−𝑃′ 𝑦 𝑥 + 𝑐𝑛 𝑝 − 𝑐𝑡 𝑝 × 1− 𝑃 𝑦 𝑥 ×𝑃′ 𝑦 𝑥 − 𝑐𝑡 𝑟 Net contribution of human correction per FN Net contribution of human correction per FP Average reading cost per token 2019 Informatics Summit | amia.org

Real World Clinical Trials Dataset
370 documents tokens 12 PHI types 7098 PHI instances 2019 Informatics Summit | amia.org

Preliminary Analysis 2019 Informatics Summit | amia.org

Design of Simulation Experiments
Batch Size Query Strategy Query Setting LCUB Case 1 LCUB Case n . LCUB Best case LCUB Batch Size 10 ELB Case 1 ELB Case m . ELB Best case ELB Initial batch of 10 documents Batch Size 5 ROI Case 1 ROI Case k . ROI Best case Batch Size 1 ROI Random Random 2019 Informatics Summit | amia.org

Active learning surpasses passive learning
AL Learning Rate The advantage of active learning becomes more apparent with smaller batch sizes Active learning surpasses passive learning 2019 Informatics Summit | amia.org

Reduction in Training Time
Active learning needs less training than passive learning Smaller batch sizes need less training than bigger batch sizes but more re-training time 2019 Informatics Summit | amia.org

i2b2 2006 Dataset 889 discharge summaries
Real identifiers replaced by synthetic information 2019 Informatics Summit | amia.org

Lessons Learned Active learning could lead to comparable or higher performance with less training data needed than passive learning Smaller batch sizes means faster learning, but also could result in more re- training time ROI usually is the most stable, but not necessarily always performs the best 2019 Informatics Summit | amia.org

Summary And Future Work
Active learning adopted in training data selection for natural language de- identification could generally result in more efficient learning than passive learning Collect data for actual human correction costs and contributions in real-world problems An adaptive batch sizing strategy might lead to better training Deep neural networks might be considered for the active learning system 2019 Informatics Summit | amia.org

References [1] U.S. Department of Health and Human Services, "Standards for privacy and individually identifiable health information. Final rule," vol. 67, no. 157, pp , [2] W. W. Chapman, P. M. Nadkarni, L. Hirschman, L. W. D'Avolio, G. K. Savova and O. Uzuner, "Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions," Journal of the American Medical Informatics Association : JAMIA, vol. 18, no. 5, pp , [3] S. Velupillai, H. Dalianis, M. Hassel and G. Nilsson, "Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial," International journal of medical informatics, vol. 78, no. 12, pp. e , [4] I. Neamatullah, M. L. L. Douglass, A. Reisner, M. Villarroel, W. Long, P. Szolovits, G. Moody, R. Mark and G. Clifford, "Automated de-identification of free-text medical records," BMC medical informatics and decision making, vol. 8, no. 1, p. 1, [5] S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen and M. H. Samore, "Automatic de-identification of textual documents in the electronic health record: a review of recent research," BMC Medical Research Methodology, vol. 10, no. 1, p. 70, [6] O. Ferrández, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore and S. M. Meystre, "BoB, a best-of-breed automated text de-identification system for VHA clinical documents," Journal of the American Medical Informatics Association, vol. 20, no. 1, pp , [7] J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, B. Malin and L. Hirschman, "The MITRE Identification Scrubber Toolkit: design, training, and assessment," International Journal of Medical Informatics, vol. 79, no. 12, pp , [8] B. Settles, "Biomedical named entity recognition using conditional random fields and rich feature sets.," Proceedings of International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp , [9] F. Dernoncourt, J. Y. Lee, O. Uzuner and P. Szolovits, "De-identification of patient notes with recurrent neural networks," Journal of the American Medical Informatics Association, vol. 24, no. 3, pp , [10] Stubbs A, Filannino M, Uzuner O. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1. J Biomed Inform. 2017; 75: S4- S18. [11] Stubbs A, and Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015; 58: S20-S29. [12] Settles B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012; 6(1):1-14. [13] Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks. Proc Conference on Empirical Methods in Natural Language Processing. 2008:

Email me at: [rli@privacy-analytics.com]
Thank you! Questions? me at:

Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Similar presentations

Presentation on theme: "Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Similar presentations

Presentation on theme: "Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19"— Presentation transcript:

Similar presentations

About project

Feedback