Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Slides:

Advertisements

Similar presentations

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Advertisements

Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.

Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Test Automation Success: Choosing the Right People & Process

The Problem of Concept Drift: Definitions and Related Work Alexev Tsymbalo paper. (April 29, 2004)

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.

Introduction to Machine Learning Approach Lecture 5.

2012 VA IRB Administrators Meeting Stephania H. Griffin, JD, RHIA, CIPP/G VHA Privacy Officer Director, Information Access and Privacy Privacy Officer.

Extraction of Adverse Drug Effects from Clinical Records E. ARAMAKI* Ph.D., Y. MIURA **, M. TONOIKE ** Ph.D., T. OHKUMA ** Ph.D., H. MASHUICHI ** Ph.D.,K.WAKI.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

De-identifying Pathology Reports for Pathology Informatics

11 C H A P T E R Artificial Intelligence and Expert Systems.

Systematic Reviews.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL

THE TUH EEG CORPUS: A Big Data Resource for Automated EEG Interpretation A. Harati, S. López, I. Obeid and J. Picone Neural Engineering Data Consortium.

LOCAL EXPERIENCES Innovation practices and experiences related to FIC development and implementation Xavier Pastor, Artur Conesa, Raimundo Lozano-Rubí.

© 2009 The McGraw-Hill Companies, Inc. All rights reserved. 1 McGraw-Hill Chapter 2 The HIPAA Privacy Standards HIPAA for Allied Health Careers.

Understanding HIPAA (Health Insurandce Portability and Accountability Act)

De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA.

Detection of Spelling Errors in Swedish Clinical Text Nizamuddin Uddin and Hercules Dalianis Department of Computer and Systems Sciences, (DSV)

Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Session 6: Data Flow, Data Management, and Data Quality.

Identification of eligible patients for clinical research within primary care (examples from Keele) Presented by Dr Martyn Lewis.

Amber Stubbs, Christopher Kotfila, Ozlem Uzuner Journal of Biomedical Informatics DOI: /j.jbi

The Neural Engineering Data Consortium Mission: To focus the research community on a progression of research questions and to generate massive data sets.

Best-of-Breed Hybrid Methods for Text De-identification Yang H, Garibaldi JM. Automatic detection of protected health information from clinical narratives.

Extracting CHF information from clinical text using CLAMP Hua Xu, PhD pSCANNER

Tim Friede Department of Medical Statistics

Brief Intro to Machine Learning CS539

Laboratory Information Management Systems (LIMS)

Showcasing work by Jonnageddala, Liaw, Ray, Kumar, Chang, and Dai on

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,

Deep Learning for Bacteria Event Identification

Sentiment analysis algorithms and applications: A survey

Medication Information Extraction

Developing the Overall Audit Plan and Audit Program

Introduction Characteristics Advantages Limitations

An Artificial Intelligence Approach to Precision Oncology

CRF &SVM in Medication Extraction

DSS & Warehousing Systems

Clinical NLP in North Germanic Languages

HS420 Health Informatics Michele Smith, PharmD, RPh, RCph

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Welcome It’s our #MedLitBlitz @Mark2Cure.

Active Learning Intrusion Detection using k-Means Clustering Selection

Introductory Seminar on Research: Fall 2017

Supervised Machine Learning

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

Ying He Wuhan University of Technology Twitter: #AMIA2017

Ying He Wuhan University of Technology

Predicting the Outcome of Patient-Provider Communication Sequences using Recurrent Neural Networks and Probabilistic Models S38: Predictive Modeling.

Fenglong Ma1, Jing Gao1, Qiuling Suo1

An Inteligent System to Diabetes Prediction

Global Enterprise Search

Real-time Protection for Open Beacon Network

Informed Consent (SBER)

The Big Health Data–Intelligent Machine Paradox

The Basics of Information Systems

Using Uneven Margins SVM and Perceptron for IE

Model Enhanced Classification of Serious Adverse Events

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

18734: Foundations of Privacy

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

De-identification of Medical Narrative Data

Automatic Handwriting Generation

The Basics of Information Systems

S04: Machine Learning in Clinical Predictive Modeling I

Presentation transcript:

Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19 Efficient Active Learning for Electronic Medical Record De- identification Privacy and Bias in Data Science S45 Muqun (Rachel) Li Privacy Analytics, an IQVIA company #IS19

Disclosure I disclose the following relevant relationship with commercial interests: I am an employee of Privacy Analytics, an IQVIA company, which builds and markets de- identification products 2019 Informatics Summit | amia.org

Learning Objectives After participating in this session the learner should be better able to: Understand the challenge posed by the human annotation cost in de-identifying natural language data, and be able to discuss several active learning algorithms to reduce such expenditure in de-identifying electronic medical records and clinical study reports. 2019 Informatics Summit | amia.org

What is De-identification? Protected Health Information (PHI) removal Direct identifiers (DI) Name SSN Medical record number … Quasi-identifiers (QI) DOB Zip-code Gender Ethnicity … Challenge for natural language data Original* Patient 4729-00012 – SAE (pneumonia) Patient details: 49-year-old, female, Caucasian The patient entered the study on JUN 03 2016. Past medical history included thyroidectomy OCT 11 2014. … De-identified Patient [SUBJID] – SAE (pneumonia) Patient details: [AGE]-year-old, [SEX], [RACE] The patient entered the study on [DATE]. Past medical history included [MEDICAL_HISTORY] [DATE]. … 2019 Informatics Summit | amia.org

Automated De-identification Tools Rule based Local knowledge, hand-crafted rules Not always easy to gather Machine learning based Models inferred automatically from annotated training data A certain amount of human annotation Hybrid Integrate the best of both Also needs both local knowledge and human annotation 2019 Informatics Summit | amia.org

Machine Learning Based De-identification Named Entity Recognition Conditional Random Fields (CRFs) Capture dependencies between type labels  what we use in this work (MIST) Recurrent Neural Networks (RNNs) Do not need handcrafted features or rules, can automatically extract features 2019 Informatics Summit | amia.org

Machine Learning De-identification Workflow Unannotated Natural Language Data Random Sample Sampled data to annotate by human Protected Data Gold Standard Data Models PHI Detection Annotation Model Training Human Detection Tools 2019 Informatics Summit | amia.org

Scalability Challenge De-identification models should not be used off-the-shelf 2016 CEGS N-GRID shared tasks: best F-measure around 0.8 Appropriately trained systems I2b2 2014 challenge: best F-measure of 0.964 Best F-measure of 0.979 via a deep neural network Can we engineer a system to learn faster? State-of-the-art: constantly needs sufficiently high-quality training data Poorly trained systems mean more time/cost in human correction 2019 Informatics Summit | amia.org

Why Active Learning? Hypothesis Machine learning model Oracle (human annotator) Unlabeled pool U Labeled training set Learn a model Hypothesis More informative data actively requested Less training data needed Performance maintained or even improved Select queries 2019 Informatics Summit | amia.org

Active Learning De-identification Workflow Heuristics Selection criteria Unannotated Natural Language Data Next batch of data to annotate by human Random Sample Query Initial batch of data to annotate by human Protected Data Gold Standard Data Annotation Model Training Models PHI Detection Human Detection Tools 2019 Informatics Summit | amia.org

Selection Criteria Heuristics Least Confidence with Upper Bound (LCUB) 𝑓 𝑥 𝑡 , 𝑓 𝑥 𝑡 = 1− 𝑃 𝑦 𝑡 𝑥 𝑡 , 𝑃 𝑦 𝑡 𝑥 𝑡 <𝜃 0, 𝑃 𝑦 𝑡 𝑥 𝑡 ≥𝜃 Entropy with Lower Bound (ELB) 𝐻 𝑥 𝑡 =− 𝑗 𝑃 𝑦 𝑡𝑗 𝑥 𝑡 𝑙𝑜𝑔𝑃 𝑦 𝑡𝑗 𝑥 𝑡 𝑔 𝑥 𝑡 , 𝑔 𝑥 𝑡 = 𝐻 𝑥 𝑡 , 𝐻 𝑥 𝑡 > 𝜌 0, &𝐻 𝑥 𝑡 ≤ 𝜌 Return On Investment (ROI) 𝑡 𝑅𝑂𝐼 𝑥 𝑡 , expected ROI of token 𝑥 labeled as non-PHI 𝑐𝑛 𝑛 − 𝑐𝑡 𝑛 ×𝑃 𝑦 𝑥 × 1−𝑃′ 𝑦 𝑥 + 𝑐𝑛 𝑝 − 𝑐𝑡 𝑝 × 1− 𝑃 𝑦 𝑥 ×𝑃′ 𝑦 𝑥 − 𝑐𝑡 𝑟 Net contribution of human correction per FN Net contribution of human correction per FP Average reading cost per token 2019 Informatics Summit | amia.org

Real World Clinical Trials Dataset 370 documents 312991 tokens 12 PHI types 7098 PHI instances 2019 Informatics Summit | amia.org

Preliminary Analysis 2019 Informatics Summit | amia.org

Design of Simulation Experiments Batch Size Query Strategy Query Setting LCUB Case 1 LCUB Case n . LCUB Best case LCUB Batch Size 10 ELB Case 1 ELB Case m . ELB Best case ELB Initial batch of 10 documents Batch Size 5 ROI Case 1 ROI Case k . ROI Best case Batch Size 1 ROI Random Random 2019 Informatics Summit | amia.org

Active learning surpasses passive learning AL Learning Rate The advantage of active learning becomes more apparent with smaller batch sizes Active learning surpasses passive learning 2019 Informatics Summit | amia.org

Reduction in Training Time Active learning needs less training than passive learning Smaller batch sizes need less training than bigger batch sizes but more re-training time 2019 Informatics Summit | amia.org

i2b2 2006 Dataset 889 discharge summaries Real identifiers replaced by synthetic information 2019 Informatics Summit | amia.org

Lessons Learned Active learning could lead to comparable or higher performance with less training data needed than passive learning Smaller batch sizes means faster learning, but also could result in more re- training time ROI usually is the most stable, but not necessarily always performs the best 2019 Informatics Summit | amia.org

Summary And Future Work Active learning adopted in training data selection for natural language de- identification could generally result in more efficient learning than passive learning Collect data for actual human correction costs and contributions in real-world problems An adaptive batch sizing strategy might lead to better training Deep neural networks might be considered for the active learning system 2019 Informatics Summit | amia.org

References [1] U.S. Department of Health and Human Services, "Standards for privacy and individually identifiable health information. Final rule," vol. 67, no. 157, pp. 53181 - 53273, 2002. [2] W. W. Chapman, P. M. Nadkarni, L. Hirschman, L. W. D'Avolio, G. K. Savova and O. Uzuner, "Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions," Journal of the American Medical Informatics Association : JAMIA, vol. 18, no. 5, pp. 540-3, 2011. [3] S. Velupillai, H. Dalianis, M. Hassel and G. Nilsson, "Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial," International journal of medical informatics, vol. 78, no. 12, pp. e19 - 26, 2009. [4] I. Neamatullah, M. L. L. Douglass, A. Reisner, M. Villarroel, W. Long, P. Szolovits, G. Moody, R. Mark and G. Clifford, "Automated de-identification of free-text medical records," BMC medical informatics and decision making, vol. 8, no. 1, p. 1, 2008. [5] S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen and M. H. Samore, "Automatic de-identification of textual documents in the electronic health record: a review of recent research," BMC Medical Research Methodology, vol. 10, no. 1, p. 70, 2010. [6] O. Ferrández, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore and S. M. Meystre, "BoB, a best-of-breed automated text de-identification system for VHA clinical documents," Journal of the American Medical Informatics Association, vol. 20, no. 1, pp. 77-83, 2013. [7] J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, B. Malin and L. Hirschman, "The MITRE Identification Scrubber Toolkit: design, training, and assessment," International Journal of Medical Informatics, vol. 79, no. 12, pp. 849-59, 2010. [8] B. Settles, "Biomedical named entity recognition using conditional random fields and rich feature sets.," Proceedings of International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 104 - 7, 2004. [9] F. Dernoncourt, J. Y. Lee, O. Uzuner and P. Szolovits, "De-identification of patient notes with recurrent neural networks," Journal of the American Medical Informatics Association, vol. 24, no. 3, pp. 596-606, 2017. [10] Stubbs A, Filannino M, Uzuner O. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1. J Biomed Inform. 2017; 75: S4- S18. [11] Stubbs A, and Uzuner O. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015; 58: S20-S29. [12] Settles B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012; 6(1):1-14. [13] Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks. Proc Conference on Empirical Methods in Natural Language Processing. 2008: 1070-9.

Email me at: [rli@privacy-analytics.com] Thank you! Questions? Email me at: [rli@privacy-analytics.com]