Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Slides:



Advertisements
Similar presentations
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Advertisements

Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
C MU U sable P rivacy and S ecurity Laboratory 1 Privacy Policy, Law and Technology Data Privacy October 30, 2008.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Attacks against K-anonymity
L-Diversity: Privacy Beyond K-Anonymity
MobiHide: A Mobile Peer-to-Peer System for Anonymous Location-Based Queries Gabriel Ghinita, Panos Kalnis, Spiros Skiadopoulos National University of Singapore.
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης Roadmap Motivation Core ideas Extensions 2.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Preserving Privacy in Clickstreams Isabelle Stanton.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Preserving Privacy in Published Data
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.
Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala cs.duke.edu Collaborators: Daniel Kifer (PSU),
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
CS573 Data Privacy and Security Anonymization methods Li Xiong.
Refined privacy models
K-Anonymity & Algorithms
Dimensions of Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Unraveling an old cloak: k-anonymity for location privacy
No Free Lunch in Data Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 15: Fall 12.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Versatile Publishing For Privacy Preservation
Privacy in Database Publishing
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Executive Director and Endowed Chair
Executive Director and Endowed Chair
Probabilistic Data Management
Differential Privacy in Practice
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Refined privacy models
Privacy-Preserving Data Publishing
Presentation transcript:

Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo

My topic How to share medical records to other third parties without compromising data privacy

Reported by AMRA "...medical information is routinely shared with and viewed by third parties who are not involved in patient care.... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName

A naïve solution It does not work. See next. publish bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge

Inference attack An adversary knows that Bob –has been hospitalized before –is 23 years old –lives in an area with zipcode bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge Published table Quasi-identifier (QI) attributes

Background Generalization Anatomy

Generalization A generalized table bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName Transform each QI value into a less specific form How much generalization do we need?

l-diversity A QI-group with m tuples is l -diverse, iff each sensitive value appears no more than m / l times in the QI-group. A table is l -diverse, iff all of its QI-groups are l -diverse. The above table is 2-diverse. 2 QI-groups Quasi-identifier (QI) attributes Sensitive attribute bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge

What l-diversity guarantees From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge 11000M23Bob ZipcodeSexAgeName A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Defect of generalization (cont.)‏ Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05 Estimated answer for query A: 2 * p = 0.1 pneumonia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge

Defect of generalization (cont.)‏ Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Estimated answer from the generalized table: 0.1 bronchitis30000F70Mandy flu25000F65Alice gastritis25000F65Linda flu54000F61Jane pneumonia12000M59Sam dyspepsia59000M35Peter dyspepsia13000M27Ken pneumonia11000M23Bob DiseaseZipcodeSexAgeName The exact answer should be: 1

Basic Idea of Anatomy For a given microdata table, Anatomy releases a quasi- identifier table (QIT) and a sensitive table (ST)‏ gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge Quasi-identifier Table (QIT)‏ Sensitive Table (ST)‏ bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge microdata

Basic Idea of Anatomy (cont.)‏ 1. Select a partition of the tuples bronchitis30000F70 flu25000F65 gastritis25000F65 flu54000F61 pneumonia12000M59 dyspepsia59000M35 dyspepsia13000M27 pneumonia11000M23 DiseaseZipcodeSexAge QI group 1 QI group 2 a 2-diverse partition

Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis flu gastritis flu pneumonia dyspepsia pneumonia Disease 30000F F F F M M M M23 ZipcodeSexAge group 1 group 2 quasi-identifier table (QIT)‏sensitive table (ST)‏

Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition bronchitis2 flu2 gastritis2 flu2 pneumonia1 dyspepsia1 1 pneumonia1 DiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏sensitive table (ST)‏

Basic Idea of Anatomy (cont.)‏ 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏

Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏ 11000M23Bob ZipcodeSexAgeName

Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] gastritis 2flu 1bronchitis 2pneumonia 2dyspepsia CountDiseaseGroup-ID F F F F M M M M23 Group-IDZipcodeSexAge quasi-identifier table (QIT)‏ sensitive table (ST)‏

Accuracy of Data Analysis (cont.)‏ Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 patients have contracted pneumonia 2 out of 4 patients satisfies the query condition on Age and Zipcode Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata M M M M23 Group-IDZipcodeSexAge t1t2t3t4t1t2t3t4

Anatomy vs. Generalization Revisit Sometimes the adversary is not sure whether an individual appears in the microdata or not bronchitis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] gastritis[10001, 60000]F[61, 70] flu[10001, 60000]F[61, 70] pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List

Anatomy vs. Generalization Revisit From the adversary’s perspective: –Bob has 4 / 6 probability to be in the microdata –If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia –So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia ………… pneumonia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] dyspepsia[10001, 60000]M[21, 60] pneumonia[10001, 60000]M[21, 60] DiseaseZipcodeSexAge A 2-diverse generalized table 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName A Voter Registration List

Anatomy vs. Generalization Revisit The adversary knows that –Bob must appear the microdata –There is 1/2 probability that Bob has contracted pneumonia … 1 1 …… 2pneumonia 2dyspepsia CountDiseaseGroup-ID ………… M M M M23 Group-IDZipcodeSexAge 2-diverse QIT 2-diverse ST 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName

Anatomy vs. Generalization Revisit For a given value of l, l -diverse generalization may lead to higher privacy protection than l -diverse anatomy does. But is not always the case, since: –the external database may not contain any irrelevant individuals –the adversary may know that some individuals indeed appear in the microdata 30000M40Mark 40000M50Ric ………… 12000M59Sam 59000M35Peter 13000M27Ken 11000M23Bob ZipcodeSexAgeName