Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo
My topic How to share medical records to other third parties without compromising data privacy
Reported by AMRA "...medical information is routinely shared with and viewed by third parties who are not involved in patient care.... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."
Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName
A naïve solution It does not work. See next. publish bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge
Inference attack An adversary knows that Bob –has been hospitalized before –is 23 years old –lives in an area with zipcode bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge Published table Quasi-identifier (QI) attributes
Background Generalization Anatomy
Generalization A generalized table bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName Transform each QI value into a less specific form How much generalization do we need?
l-diversity A QI-group with m tuples is l -diverse, iff each sensitive value appears no more than m / l times in the QI-group. A table is l -diverse, iff all of its QI-groups are l -diverse. The above table is 2-diverse. 2 QI-groups Quasi-identifier (QI) attributes Sensitive attribute bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge
What l-diversity guarantees From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006
Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions
Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05 Estimated answer for query A: 2 * p = 0.1 pneumonia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge
Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Estimated answer from the generalized table: 0.1 bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName The exact answer should be: 1
Basic Idea of Anatomy For a given microdata table, Anatomy releases a quasi- identifier table (QIT) and a sensitive table (ST) gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge Quasi-identifier Table (QIT) Sensitive Table (ST) bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge microdata
Basic Idea of Anatomy (cont.) 1. Select a partition of the tuples bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge QI group 1 QI group 2 a 2-diverse partition
Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis flu gastritis flu pneumonia dyspepsia pneumonia Disease 30000F F F F M M M M23 ZipcodeSexAge group 1 group 2 quasi-identifier table (QIT)sensitive table (ST)
Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis2 flu2 gastritis2 flu2 pneumonia1 dyspepsia1 1 pneumonia1 DiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)sensitive table (ST)
Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT) sensitive table (ST)
Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT) sensitive table (ST) 11000M23Bob ZipcodeSexAgeName
Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT) sensitive table (ST)
Accuracy of Data Analysis (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 patients have contracted pneumonia 2 out of 4 patients satisfies the query condition on Age and Zipcode Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata M M M M23 Group-IDZipcodeSexAge t1t2t3t4t1t2t3t4
Anatomy vs. Generalization Revisit Sometimes the adversary is not sure whether an individual appears in the microdata or not bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List
Anatomy vs. Generalization Revisit From the adversary’s perspective: –Bob has 4 / 6 probability to be in the microdata –If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia –So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia ………… pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List
Anatomy vs. Generalization Revisit The adversary knows that –Bob must appear the microdata –There is 1/2 probability that Bob has contracted pneumonia … 1 1 …… 2pneumonia 2dyspepsia CountDiseaseGroup-ID ………… M M M M23 Group-IDZipcodeSexAge 2-diverse QIT 2-diverse ST 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName
Anatomy vs. Generalization Revisit For a given value of l, l -diverse generalization may lead to higher privacy protection than l -diverse anatomy does. But is not always the case, since: –the external database may not contain any irrelevant individuals –the adversary may know that some individuals indeed appear in the microdata 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName