Global Disclosure Risk for Microdata with Continuous Attributes

Slides:



Advertisements
Similar presentations
Tracking Meeting Khaled El Emam, CHEO RI & uOttawa.
Advertisements

HIPAA Privacy Rule “Standards for Privacy of Individually Identifiable Health Information” 45 CFR 160 and 164* *
HIPAA and Public Health 2007 Epi Rapid Response Team Conference.
HIPAA – Privacy Rule and Research USCRF Research Educational Series March 19, 2003.
Increasing public concern about loss of privacy Broad availability of information stored and exchanged in electronic format Concerns about genetic information.
HIPAA Training for Pharmaceutical Industry Representatives University of Utah Hospitals & Clinics.
HIPAA Health Insurance Portability and Accountability Act.
HIPAA Requirements for Patient Oriented Research
Informed Consent.
Health Insurance Portability & Accountability Act “HIPAA” To every patient, every time, we will provide the care that we would want for our own loved ones.
Professional Nursing Services.  Privacy and Security Training explains:  The requirements of the federal HIPAA/HITEC regulations, state privacy laws.
Protecting Client Data HIPAA, HITECH and PIPA Part 1A
HIPAA Training Presentation for New Employees How did we get here? HIPAA Police 1.
Health information security & compliance
CCHAP Practice Manager’s Meeting HIPAA Guidelines and Updates for Primary Care Practices Thursday October 24 th 2013 Noon – 1:00PM Instructions to join.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
SPECIAL DIABETES PROGRAM FOR INDIANS Competitive Grant Program Special Diabetes Program for Indians Competitive Grant Program SPECIAL DIABETES PROGRAM.
HIPAA, Researchers and the IRB Alan Homans, IRB Chair and Nancy Stalnaker, IRB Administrator.
Public Aggregate Reporting – DHCS Business Reports Overview
HIPAA What’s Said Here – Stays Here…. WHAT IS HIPAA  Health Insurance Portability and Accountability Act  Purpose is to protect clients (patients)
HIPAA Health Insurance Portability & Accountability Act of 1996.
Health Insurance Portability and Accountability Act (HIPAA)
Data Security and Research 101 Completing Required Forms Kimberly Summers, PharmD Assistant Chief for Clinical Research South Texas Veterans Health Care.
Protected Health Information (PHI). Privileged Communication An exchange of information between two individuals in a confidential relationship. (Examples:
Paula Peyrani, MD Medical/Project Director, HIV Program at the 550 Clinic Assistant Director, Research Design and Development Clinical and Translational.
HIPAA Business Associates Leadership Group Meeting June 28, 2001.
1 Research & Accounting for Disclosures March 12, 2008 Leslie J. Pfeffer, BS, CHP Office of the Vice President for Research Administration Office of Compliance.
NIH Data Sharing Dr. Belinda Seto, Deputy Director National Institute of Biomedical Imaging and Biomedical Engineering October, 2006 CODATA Workshop.
1 HIPAA OVERVIEW ETSU. 2 What is HIPAA? Health Insurance Portability and Accountability Act.
14 May Privacy Requirements Phoenix Ambulatory Blood Pressure Monitoring System © 2006 Christopher J. Adams Copying and distribution of this document.
HIPAA Privacy and Research August 21, 2015
De-identifying Pathology Reports for Pathology Informatics
Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be.
HIPAA – How Will the Regulations Impact Research?.
NIH Data Sharing Dr. Belinda Seto, Deputy Director National Institute of Biomedical Imaging and Biomedical Engineering (NBIB) June 23, 2004 Collaborative.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
An Overview of Statistics Section 1.1. Ch1 Larson/Farber 2 Statistics is the science of collecting, organizing, analyzing, and interpreting data in order.
Configuring Electronic Health Records Privacy and Security in the US Lecture b This material (Comp11_Unit7b) was developed by Oregon Health & Science University.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
Teaching & POEMs and DOEs in an Online Classroom Jacob Reider, MD David C Ross Albany Medical College.
Final HIPAA Privacy Rule: The Research Provisions Julie Kaneshiro DHHS Office for Human Research Protections Phone: Fax:
Privacy: HIPAA Emerson Murphy-Hill. Rosie Callender, RHIA, web.msm.edu/hipaa/An%20Introduction%20to%20HIPAA.ppt What is HIPAA? A Federal Law Created in.
HIPAA and RESEARCH 5 th Thursday May 31, Page 2.
Reviewed by: Gunther Kohn Chief Information Officer, UB School of Dental Medicine Date: October 20, 2015 Approved by: Sarah L. Augustynek Compliance Officer,
UC Riverside Health Training and Development
Understanding and Applying New HIPAA Policy Requirements
Winter 2008 HIPAA, Privacy & Confidentiality.
HIPAA PRIVACY & SECURITY TRAINING
Disclosure scenario and risk assessment: Structure of Earnings Survey
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Security and Privacy in Mobile Computing
Protecting our members, our company, and our selves
No No, Yes Yes: Simple Privacy & Information Security Tips Krista Barnes, J.D. Senior Legal Officer and Director, Privacy & Information Security, Institutional.
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Transfer of Materials, Confidential Information, and Data
De-identifying Health Data: Measuring and Controlling Disclosure Risk
Data Anonymization – Introduction
How to Secure will secure s when the word secure is inserted anywhere in the subject line. Secure in the subject line:
The Health Insurance Portability and Accountability Act
HIPAA Overview.
HIPAA Privacy & Security: Medical Research Context
New School Violence Law; HIPAA Privacy Training
HIPAA & PHI TRAINING & AWARENESS
Issues in HIPAA Research Compliance
The Health Insurance Portability and Accountability Act
Case Study Template Kerecis Aurora Awards
Office of the Vice President for Research Human Subjects Protection Program IRB Submission Process Module 4 - Health Insurance Portability and Accountability.
The Health Insurance Portability and Accountability Act
From Baby Boomers to Millennials
Presentation transcript:

Global Disclosure Risk for Microdata with Continuous Attributes Traian Marius Truta Northern Kentucky University

Traian Truta - Northern Kentucky University HIPAA Privacy Rule The Health Insurance Portability and Accountability Act (1996) The Privacy Rule protects the privacy of the individually identifiable health information by establishing conditions for its use and disclosure Privacy Rule effective date: 14 April 2003 Define 18 identifiers that must be removed in order to de-identify the data 11/18/2018 Traian Truta - Northern Kentucky University

The Identifiers in the Privacy Rule Names Telephone # Fax # E-mail address Social Security # Medical record, prescription # Health Plan beneficiary # Account # Certificates/license # VIN and serial #, license plate # Device identifiers, serial #, Web URLs IP address Biometric identifiers (finger prints) Full face photo images Unique identifying # 11/18/2018 Traian Truta - Northern Kentucky University

The Identifiers in the Privacy Rule Names Telephone # Fax # E-mail address Social Security # Medical record, prescription # Health Plan beneficiary # Account # Certificates/license # VIN and serial #, license plate # Device identifiers, serial #, Web URLs IP address Biometric identifiers (finger prints) Full face photo images Unique identifying # Geographic info (including city, state, and zip) Elements of dates 11/18/2018 Traian Truta - Northern Kentucky University

De-identification Process Remove all 18 defined identifiers and no knowledge that remaining information can identify the individual (Safe Harbor) Statistically “de-identified” information where a statistician certifies that there is a “very small” risk that the information could be used to identify the individual 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Control Problem Individuals Submit Collect Data Masking Process Data Owner Release Receive Masked Data Researcher Intruder 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Control Problem Individuals Submit Collect Data Confidentiality of Individuals Measures of Disclosure Risk Masking Process Data Owner Preserve Data Utility Measures of Information Loss Release Receive Masked Data Researcher Intruder 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Control Problem Individuals Submit Collect Data Confidentiality of Individuals Measures of Disclosure Risk Masking Process Data Owner Preserve Data Utility Measures of Information Loss Release Receive Masked Data Researcher Intruder Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information External Data 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Control Problem Individuals This Presentation Submit Collect Data Confidentiality of Individuals Measures of Disclosure Risk Masking Process Data Owner Preserve Data Utility Measures of Information Loss Release Receive Masked Data Researcher Intruder Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information External Data 11/18/2018 Traian Truta - Northern Kentucky University

General Framework for Microdata I – Identifier Attributes (Name, SSN, etc. ) K – Key Attributes (Zip Code, Age, Race, etc.) S – Confidential Attributes (Income, Diagnosis, etc.) 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Control Techniques Different disclosure control techniques are applied to the following initial microdata: RecID Name SSN Age State Diagnosis Income Billing 1 John Wayne 123456789 44 MI AIDS 45,500 1,200 2 Mary Gore 323232323 Asthma 37,900 2,500 3 John Banks 232345656 55 67,000 3,000 4 Jesse Casey 333333333 21,000 1,000 5 Jack Stone 444444444 90,000 900 6 Mike Kopi 666666666 45 Diabetes 48,000 750 7 Angela Simms 777777777 25 IN 49,000 8 Nike Wood 888888888 35 66,000 2,200 9 Mikhail Aaron 999999999 69,000 4,200 10 Sam Pall 100000000 Tuberculosis 34,000 3,100 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Remove Identifiers Identifiers such as Names, SSN etc. are removed RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Sampling Sampling is the disclosure control method in which only a subset of records is released If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample RecID Age State Diagnosis Income Billing 5 55 MI Asthma 90,000 900 4 44 21,000 1,000 8 35 AIDS 66,000 2,200 9 69,000 4,200 7 25 IN Diabetes 49,000 1,200 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Microaggregation Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average Microaggregation for attribute Income and minimum size 3 The total sum for all Income values remains the same. RecID Age State Diagnosis Income Billing 2 44 MI Asthma 30,967 2,500 4 1,000 10 45 Tuberculosis 3,100 1 AIDS 47,500 1,200 6 Diabetes 750 7 25 IN 3 55 73,000 3,000 5 900 8 35 2,200 9 4,200 11/18/2018 Traian Truta - Northern Kentucky University

Global Disclosure Risk Measures Assumptions The intruder does not know any confidential information The intruder knows all the key and identifier values for population Objectives DR Measures for specific DC methods (Remove Identifiers, Sampling, Microaggregation, etc.) DR Measures for any combinations of DC methods Proposed measures DRmin  DRW  DRmax 11/18/2018 Traian Truta - Northern Kentucky University

Notations for IM and IMM n – the number of entities in the population. F – the number of clusters with the same values for key attributes. Ak – the set of elements from the k-th cluster for all k, 1  k  F. Fi = | {Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1  i  n. Fi represents the number of clusters with the same length. ni =| {x  Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1  i  n. ni represents the number of records in clusters of length i. 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Remove Identifiers Method RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 {1, 2, 4} {3, 5, 9} {6, 10} {7} {8} n =10 n1 = 2 n2 = 2 n3 = 6 F = 5 F1 = 2 F2 = 1 F3 = 2 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Remove Identifiers Method - percentage of unique records - considers probabilistic linkage - weights defined by data owner w = (w1, w2, …, wN) disclosure risk weight vector. Properties a) wi  R+ for all i = 1, .. , n; b) wi  wj for all i  j, i,j = 1, .. , n; 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Remove Identifiers Method RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 n =10 n1 = 2 n2 = 2 n3 = 6 F = 5 F1 = 2 F2 = 1 F3 = 2 w1 = (5, 5, 0, 0, ..., 0) w2 = (4, 3, 3, 0, ..., 0) DRmin DRw1 DRw2 DRmax 0.2 0.3 0.425 0.5 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for RI Method with Continuous Attribute What if the intruder has only approximations of income? RecID Income State Diagnosis Billing 1 23,001 MI AIDS 1,200 2 23.005 Asthma 2,500 3 67,000 3,000 4 22,998 1,000 5 66,975 900 6 49,001 Diabetes 750 7 49,000 IN 8 67,010 2,200 9 67,006 4,200 10 23,003 Tuberculosis 3,100 n =10 n1 = 10 n2 = 0 n3 = 0 F = 10 F1 = 10 F2 = 0 F3 = 0 w1 = (5, 5, 0, 0, ..., 0) w2 = (4, 3, 3, 0, ..., 0) DRmin DRw1 DRw2 DRmax 1 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for RI Method with Continuous Attribute We consider vicinity sets! RecID Income State Diagnosis Billing 1 23,001 MI AIDS 1,200 2 23.005 Asthma 2,500 3 67,000 3,000 4 22,998 1,000 5 66,975 900 6 49,001 Diabetes 750 7 49,000 IN 8 67,010 2,200 9 67,006 4,200 10 23,003 Tuberculosis 3,100 n =10 n1 = 2 n2 = n3 = 0 n4 = 8 F = 4 F1 = 2 F2 = F3 = 0 F4 = 2 w1 = (5, 5, 0, 0, ..., 0) w2 = (4, 3, 3, 0, ..., 0) DRmin DRw1 DRw2 DRmax 0.2 0.4 11/18/2018 Traian Truta - Northern Kentucky University

Notations for Masked Microdata f – the number of clusters with the same values for key attributes in M. We cluster all records from M based on their key values. Bk – the set of elements from the k-th cluster for all k, 1  k  f. fi = | {Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1  i  n. fi represents the number of clusters with the same length. ti =| {x  Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1  i  n. ti represents the number of records in clusters of length i. C – the classification matrix. For all i, j = 1, .., n; cij ==| {x  Bk and x  Ap | |Bk| = i, for all k = 1, .., f and |Ap| = j, for all p = 1, .., F }|. Each element of C, cij, represents the number of records that appears in clusters of size i in the masked microdata and appeared in clusters of size j in the initial masked microdata. 11/18/2018 Traian Truta - Northern Kentucky University

Algorithm for Creating Classification Matrix Initialize each element from C with 0. For each element s from masked microdata MM do Count the number of occurrences of key values of s in masked microdata MM.Let i be this number. Count the number of occurrences of key values of s in initial microdata IM.Let j be this number. Increment cij by 1. End for. 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method What if data is continuous ? 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method Initial Microdata RecID Name SSN Income Sex Diagnosis 1 John Wayne 123456789 23,104 Male AIDS 2 Pete Gore 323232323 23,100 Asthma 3 John Banks 232345656 22,991 4 Jessica Casey 333333333 64,999 Female 5 Mary Stone 444444444 65,001 6 Patricia Kopi 666666666 65,005 Diabetes 7 Stan Simms 777777777 22,989 8 Kim Wood 888888888 65,007 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method Univariate microaggregation for attribute Age and size = 2,4,8; RecID Income Sex Diagnosis 1 23,102 Male AIDS 2 Asthma 3 22,990 4 65,000 Female 5 6 65,006 Diabetes 7 8 RecID Income Sex Diagnosis 1 22,996 Male AIDS 2 Asthma 3 4 65,003 Female 5 6 Diabetes 7 8 RecID Income Sex Diagnosis 1 43,999.5 Male AIDS 2 Asthma 3 4 Female 5 6 Diabetes 7 8 Masked Microdata 1 Masked Microdata 2 Masked Microdata 3 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method Example – Disclosure risk values NO VICINITY! W1 W2 W3 W4 MM0 1 MM1 0.50 0.25 MM2 MM3 11/18/2018 Traian Truta - Northern Kentucky University

Disclosure Risk Measures for Microaggregation Method Example – Disclosure risk values WITH VICINITY! W1 W2 W3 W4 MM0 0.25 MM1 MM2 MM3 11/18/2018 Traian Truta - Northern Kentucky University

General Disclosure Risk Measures icfk – inversion-change factor for attribute k p – number of key attributes v – binary vector associated to key attribute 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Experimental Data Simulated medical record billing data Age, Sex, Zip and Amount_Billed Three initial microdata: n = 1,000 (called IM1000) n = 5,000 (IM5000) n = 25,000 (IM25000) Key attributes: KA1 = {Age, Sex, Zip} KA2 = {Age, Sex} 11/18/2018 Traian Truta - Northern Kentucky University

Results for Sampling and Microaggregation Sampling, followed by microaggregation for Age when IM5000 and KA1 are used. 11/18/2018 Traian Truta - Northern Kentucky University

Results for Sampling and Microaggregation Sampling and microaggregation for Age when IM5000 and KA1 are used. 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Conclusions The data owner may customize its disclosure risk measure to reflect better the characteristics of the microdata. Privacy requirements may help data owner to define the disclosure risk weight matrix. Importance of masking key attributes with small vicinity sets 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Future Work Our experiments were focused on healthcare microdata; experiments for other types of data, such as financial data are needed. To study disclosure control for microdata under the assumption that the initial microdata is frequently updated (Dynamic Disclosure Control) 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Some Papers Details about DR Measures “Disclosure Risk Measures for Sampling Disclosure Control Method,” to appear in the Proceedings of ACM Symposium on Applied Computing (SAC2004), special track on Computer Applications in Health Care (COMPAHEC2004), Nicosia, Cyprus “Disclosure Risk Measures for Microdata,” Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM2003), Cambridge, Ma, pp. 15 – 22, 2003 Information Loss Measures “Privacy and Confidentiality Management for the Microaggregation Disclosure Control Method,” Proceedings of the Workshop on Privacy and Electronic Society (WPES2003), In Conjunction with 10th ACM CCS, Washington DC, pp. 21 – 30, 2003 Automatic Masked Microdata Generator “Automatic Generation of Masked Microdata,” to appear in the Acta Universitatis Apulensis, Alba Iulia, Romania 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Acknowledgements Dr. Farshad Fotouhi Dr. Daniel Barth-Jones 11/18/2018 Traian Truta - Northern Kentucky University

Traian Truta - Northern Kentucky University Questions? 11/18/2018 Traian Truta - Northern Kentucky University