Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.

Slides:

Advertisements

Similar presentations

CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING

Advertisements

Cipher Techniques to Protect Anonymized Mobility Traces from Privacy Attacks Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip and Nageswara S. V. Rao.

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.

M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.

Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.

1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.

Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.

C MU U sable P rivacy and S ecurity Laboratory 1 Privacy Policy, Law and Technology Data Privacy October 30, 2008.

1 Global Privacy Guarantee in Serial Data Publishing Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jia Liu 2, Ke Wang 3, Yabo Xu 4 The Hong Kong University.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

Attacks against K-anonymity

L-Diversity: Privacy Beyond K-Anonymity

MobiHide: A Mobile Peer-to-Peer System for Anonymous Location-Based Queries Gabriel Ghinita, Panos Kalnis, Spiros Skiadopoulos National University of Singapore.

PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.

Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.

Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.

k-Anonymity and Other Cluster-Based Methods

Database Laboratory Regular Seminar TaeHoon Kim.

1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,

Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)

Publishing Microdata with a Robust Privacy Guarantee

APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.

Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.

Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.

CS573 Data Privacy and Security Anonymization methods Li Xiong.

Refined privacy models

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

K-Anonymity & Algorithms

Dimensions of Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.

Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.

Differential Privacy SIGMOD 2012 Tutorial Marianne Winslett University of Illinois at Urbana-Champaign Advanced Digital Sciences Center, Singapore Including.

Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.

1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Privacy-preserving data publishing

1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

CSCI 347, Data Mining Data Anonymization.

Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.

Differential Privacy (1). Outline  Background  Definition.

Unraveling an old cloak: k-anonymity for location privacy

Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.

Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.

Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.

Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.

Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.

Versatile Publishing For Privacy Preservation

University of Texas at El Paso

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Presented by : SaiVenkatanikhil Nimmagadda

TELE3119: Trusted Networks Week 4

Refined privacy models

Privacy-Preserving Data Publishing

Presentation transcript:

Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data

Abstract Expanded scientific knowledge, combined with the development of the net and widespread use of computers have increased the need for strong privacy protection for medical records. We have all heard stories of harassment that has resulted because of the lack of adequate privacy protection of medical records. "...medical information is routinely shared with and viewed by third parties who are not involved in patient care.... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

Methods Generalization k-anonymity l-diversity t-closeness m-invariance Personalized Privacy Preservation Anatomy

Privacy preserving data publishing Microdata NameAgeZipcodeDisease Bob dyspepsia Alice bronchitis Andy flu David gastritis Gary flu Helen gastritis Jane dyspepsia Ken flu Linda gastritis Paul dyspepsia Steve gastritis

Classification of Attributes Key Attribute: Name, Address, Cell Phone which can uniquely identify an individual directly Always removed before release. Quasi-Identifier: 5-digit ZIP code,Birth date, gender A set of attributes that can be potentially linked with external information to re-identify entities 87% of the population in U.S. can be uniquely identified based on these attributes, according to the Census summary data in Suppressed or generalized

Classification of Attributes(Cont ’ d) Sensitive Attribute: Medical record, wage,etc. Always released directly. These attributes is what the researchers need. It depends on the requirement.

Inference attack AgeZipcodeDisease dyspepsia bronchitis flu gastritis flu gastritis dyspepsia flu gastritis dyspepsia gastritis Published table An adversary Quasi-identifier (QI) attributes NameAgeZipcode Bob

Generalization Transform the QI values into less specific forms generalize AgeZipcodeDisease dyspepsia bronchitis flu gastritis flu gastritis dyspepsia flu gastritis dyspepsia gastritis AgeZipcodeDisease [21, 22][12k, 14k]dyspepsia [21, 22][12k, 14k]bronchitis [23, 24][18k, 25k]flu [23, 24][18k, 25k]gastritis [36, 41][20k, 27k]flu [36, 41][20k, 27k]gastritis [37, 43][26k, 35k]dyspepsia [37, 43][26k, 35k]flu [37, 43][26k, 35k]gastritis [52, 56][33k, 34k]dyspepsia [52, 56][33k, 34k]gastritis

Generalization Transform each QI value into a less specific form A generalized table An adversary NameAgeZipcode Bob AgeZipcodeDisease [21, 22][12k, 14k]dyspepsia [21, 22][12k, 14k]bronchitis [23, 24][18k, 25k]flu [23, 24][18k, 25k]gastritis [36, 41][20k, 27k]flu [36, 41][20k, 27k]gastritis [37, 43][26k, 35k]dyspepsia [37, 43][26k, 35k]flu [37, 43][26k, 35k]gastritis [52, 56][33k, 34k]dyspepsia [52, 56][33k, 34k]gastritis

K-Anonymity Sweeny came up with a formal protection model named k- anonymity What is K-Anonymity? If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. Example. If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

Attacks Against K-Anonymity Unsorted Matching Attack This attack is based on the order in which tuples appear in the released table. Solution: Randomly sort the tuples before releasing.

Attacks Against K-Anonymity(Cont’d) ZipcodeAgeDisease 476**2*Heart Disease 476**2*Heart Disease 476**2*Heart Disease 4790*≥40Flu 4790*≥40Heart Disease 4790*≥40Cancer 476**3*Heart Disease 476**3*Cancer 476**3*Cancer Bob ZipcodeAge A 3-anonymous patient table Carl ZipcodeAge k-Anonymity does not provide privacy if: Sensitive values in an equivalence class lack diversity The attacker has background knowledge Homogeneity Attack Background Knowledge Attack A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity Distinct l-diversity Each equivalence class has at least l well-represented sensitive values Limitation: Example. In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%. A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity(Cont’d) Entropy l-diversity Each equivalence class not only must have enough different sensitive values, but also the different sensitive values must be distributed evenly enough. Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity. Recursive (c,l)-diversity The most frequent value does not appear too frequently A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Limitations of l-Diversity l-diversity may be difficult and unnecessary to achieve. A single sensitive attribute Two values: HIV positive (1%) and HIV negative (99%) Very different degrees of sensitivity l-diversity is unnecessary to achieve 2-diversity is unnecessary for an equivalence class that contains only negative records l-diversity is difficult to achieve Suppose there are records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Limitations of l-Diversity(Cont’d) l-diversity is insufficient to prevent attribute disclosure. Skewness Attack l-diversity does not consider the overall distribution of sensitive values Two sensitive values HIV positive (1%) and HIV negative (99%) Serious privacy risk Consider an equivalence class that contains an equal number of positive records and negative records l-diversity does not differentiate: Equivalence class 1: 49 positive + 1 negative Equivalence class 2: 1 positive + 49 negative

Limitations of l-Diversity(Cont’d) Bob ZipAge ZipcodeAgeSalaryDisease 476**2*3KGastric Ulcer 476**2*4KGastritis 476**2*5KStomach Cancer 4790*≥406KGastritis 4790*≥4011KFlu 4790*≥408KBronchitis 476**3*7KBronchitis 476**3*9KPneumonia 476**3*10KStomach Cancer A 3-diverse patient table Conclusion 1.Bob’s salary is in [3k,5k], which is relative low. 2.Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values l-diversity is insufficient to prevent attribute disclosure. Similarity Attack

t-Closeness: A New Privacy Measure Rationale AgeZipcode……GenderDisease **……*Flu **……*Heart Disease **……*Cancer …… ** *Gastritis External Knowledge Overall distribution Q of sensitive values BeliefKnowledge B0B0 B1B1 A completely generalized table

t-Closeness: A New Privacy Measure Rationale External Knowledge AgeZipcode……GenderDisease 2*479**……MaleFlu 2*479**……MaleHeart Disease 2*479**……MaleCancer …… ≥504766*……*Gastritis Overall distribution Q of sensitive values Distribution P i of sensitive values in each equi-class BeliefKnowledge B0B0 B1B1 B2B2 A released table

t-Closeness: A New Privacy Measure Rationale External Knowledge Overall distribution Q of sensitive values Distribution P i of sensitive values in each equi-class BeliefKnowledge B0B0 B1B1 B2B2 Observations Q should be public Knowledge gain in two parts: Whole population (from B 0 to B 1 ) Specific individuals (from B 1 to B 2 ) We bound knowledge gain between B 1 and B 2 instead Principle The distance between Q and P i should be bounded by a threshold t.

How to calculate EMD EMD for numerical attributes Ordered-distance is a metric  Non-negative, symmetry, triangle inequality Let r i =p i -q i, then D[P,Q] is calculated as:

Earth Mover’s Distance Example {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} Move 1/9 probability for each of the following pairs 3k->5k,3k->4k cost: 1/9*(2+1)/8 4k->8k,4k->7k,4k->6k cost: 1/9*(4+3+2)/8 5k->11k,5k->10k,5k->9k cost: 1/9*(5+6+4)/8 Total cost: 1/9*27/8=0.375 With P2={6k,8k,11k}, we can get the total cost is < This make more sense than the other two distance calculation method.

Motivating Example A hospital keeps track of the medical records collected in the last three months. The microdata table T(1), and its generalization T*(1), published in Apr NameAgeZipcodeDisease Bob dyspepsia Alice bronchitis Andy flu David gastritis Gary flu Helen gastritis Jane dyspepsia Ken flu Linda gastritis Paul dyspepsia Steve gastritis Microdata T(1) G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis 2-diverse Generalization T*(1)

Motivating Example Bob was hospitalized in Mar NameAgeZipcode Bob G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis 2-diverse Generalization T*(1)

Motivating Example One month later, in May 2007 NameAgeZipcodeDisease Bob dyspepsia Alice bronchitis Andy flu David gastritis Gary flu Helen gastritis Jane dyspepsia Ken flu Linda gastritis Paul dyspepsia Steve gastritis Microdata T(1)

Motivating Example One month later, in May 2007 Some obsolete tuples are deleted from the microdata. Microdata T(1) NameAgeZipcodeDisease Bob dyspepsia Alice bronchitis Andy flu David gastritis Gary flu Helen gastritis Jane dyspepsia Ken flu Linda gastritis Paul dyspepsia Steve gastritis

Motivating Example Bob’s tuple stays. Microdata T(1) NameAgeZipcodeDisease Bob dyspepsia David gastritis Gary flu Jane dyspepsia Linda gastritis Steve gastritis

Motivating Example Some new records are inserted. Microdata T(2) NameAgeZipcodeDisease Bob dyspepsia David gastritis Emily flu Jane dyspepsia Linda gastritis Gary flu Mary gastritis Ray dyspepsia Steve gastritis Tom gastritis Vince flu

Motivating Example The hospital published T*(2). NameAgeZipcodeDisease Bob dyspepsia David gastritis Emily flu Jane dyspepsia Linda gastritis Gary flu Mary gastritis Ray dyspepsia Steve gastritis Tom gastritis Vince flu Microdata T(2) G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis 2[25, 43][21k, 33k]flu 2[25, 43][21k, 33k]dyspepsia 3[25, 43][21k, 33k]gastritis 3[41, 46][20k, 30k]flu 4[41, 46][20k, 30k]gastritis 4[54, 56][31k, 34k]dyspepsia 4[54, 56][31k, 34k]gastritis 5[60, 65][36k, 44k]gastritis 5[60, 65][36k, 44k]flu 2-diverse Generalization T*(2)

Motivating Example Consider the previous adversary. NameAgeZipcode Bob G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis 2[25, 43][21k, 33k]flu 2[25, 43][21k, 33k]dyspepsia 3[25, 43][21k, 33k]gastritis 3[41, 46][20k, 30k]flu 4[41, 46][20k, 30k]gastritis 4[54, 56][31k, 34k]dyspepsia 4[54, 56][31k, 34k]gastritis 5[60, 65][36k, 44k]gastritis 5[60, 65][36k, 44k]flu 2-diverse Generalization T*(2)

Motivating Example What the adversary learns from T*(1). What the adversary learns from T*(2). So Bob must have contracted dyspepsia! A new generalization principle is needed. NameAgeZipcode Bob G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis …… NameAgeZipcode Bob G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis ……

The critical absence phenomenon We refer to such phenomenon as the critical absence phenomenon A new generalization method is needed. NameAgeZipcodeDisease Bob dyspepsia David gastritis Emily flu Jane dyspepsia Linda gastritis Gary flu Mary gastritis Ray dyspepsia Steve gastritis Tom gastritis Vince flu Microdata T(2) NameAgeZipcode Bob G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis …… What the adversary learns from T*(1)

NameGroup-IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Counterfeited generalization T*(2) Group-IDCount The auxiliary relation R(2) for T*(2) NameAgeZipcodeDisease Bob dyspepsia David gastritis Emily flu Jane dyspepsia Linda gastritis Gary flu Mary gastritis Ray dyspepsia Steve gastritis Tom gastritis Vince flu Microdata T(2)

NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Counterfeited Generalization T*(2) Group-IDCount The auxiliary relation R(2) for T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) NameAgeZipcode Bob

m-uniqueness A generalized table T*(j) is m-unique, if and only if each QI-group in T*(j) contains at least m tuples all tuples in the same QI-group have different sensitive values. G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis A 2-unique generalized table

Signature The signature of Bob in T*(1) is {dyspepsia, bronchitis} The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis …………… Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis …………… T*(1)

The m-invariance principle A sequence of generalized tables T*(1), …, T*(n) is m- invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

Motivation 1: Personalization Andy does not want anyone to know that he had a stomach problem Sarah does not mind at all if others find out that she had flu NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu A 2-diverse tableAn external database

Motivation 2: SA generalization How many female patients are there with age above 30? 4 ∙ (60 – 30 ) / (60 – 20 ) = 3 Real answer: 1 AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu A generalized table NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

Motivation 2: SA generalization (cont.) Generalization of the sensitive attribute is beneficial in this case AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]flu 56F58000 respiratory infection A better generalized table NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

Personalized anonymity We propose a mechanism to capture personalized privacy requirements criteria for measuring the degree of security provided by a generalized table

Guarding node Andy does not want anyone to know that he had a stomach problem He can specify “stomach disease” as the guarding node for his tuple The data publisher should prevent an adversary from associating Andy with “stomach disease” NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease

Guarding node Sarah is willing to disclose her exact symptom She can specify Ø as the guarding node for her tuple NameAgeSexZipcodeDiseaseguarding node Sarah28F37000flu Ø

Guarding node Bill does not have any special preference He can specify the guarding node for his tuple as the same with his sensitive value NameAgeSexZipcodeDiseaseguarding node Bill5M14000dyspepsia

A personalized approach NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Personalized anonymity A table satisfies personalized anonymity with a parameter p breach Iff no adversary can breach the privacy requirement of any tuple with a probability above p breach If p breach = 0.3, then any adversary should have no more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia etc NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Personalized anonymity Personalized anonymity with respect to a predefined parameter p breach an adversary can breach the privacy requirement of any tuple with a probability at most p breach AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection We need a method for calculating the breach probabilities What is the probability that Andy had some stomach problem?

Combinatorial reconstruction Assumptions the adversary has no prior knowledge about each individual every individual involved in the microdata also appears in the external database

Combinatorial reconstruction Andy does not want anyone to know that he had some stomach problem What is the probability that the adversary can find out that “Andy had a stomach disease”? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

Combinatorial reconstruction (cont.) Can each individual appear more than once? No = the primary case Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

Combinatorial reconstruction (cont.) Can each individual appear more than once? No = the primary case Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

Breach probability (primary) Totally 120 possible reconstructions If Andy is associated with a stomach disease in n b reconstructions The probability that the adversary should associate Andy with some stomach problem is n b / 120 Andy is associated with gastric ulcer in 24 reconstructions dyspepsia in 24 reconstructions gastritis in 0 reconstructions n b = 48 The breach probability for Andy’s tuple is 48 / 120 = 2 / 5 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

Breach probability (non-primary) Totally 625 possible reconstructions Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions n b = 225 The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05 Estimated answer for query A: 2 * p = 0.1 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Estimated answer from the generalized table: 0.1 NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis The exact answer should be: 1

Basic Idea of Anatomy For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F Quasi-identifier Table (QIT) Sensitive Table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis microdata

Basic Idea of Anatomy (cont.) 1. Select a partition of the tuples AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis QI group 1 QI group 2 a 2-diverse partition

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Disease pneumonia dyspepsia pneumonia flu gastritis flu bronchitis AgeSexZipcode 23M M M M F F F F30000 group 1 group 2 quasi-identifier table (QIT)sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Group-IDDisease 1pneumonia 1dyspepsia 1 1pneumonia 2flu 2gastritis 2flu 2bronchitis AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT)sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST)

Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l NameAgeSexZipcode Bob23M11000 Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST)

Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST)

Accuracy of Data Analysis (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 patients have contracted pneumonia 2 out of 4 patients satisfies the query condition on Age and Zipcode Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata AgeSexZipcodeGroup-ID 23M M M M t1t2t3t4t1t2t3t4

Conclusion Limitations of l-diversity l-diversity is difficult and unnecessary to achieve l-diversity is insufficient in preventing attribute disclosure t-Closeness as a new privacy measure The overall distribution of sensitive values should be public information The separation of the knowledge gain EMD to measure distance EMD captures semantic distance well Simple formulas for three ground distances

Conclusions m-invariant table support republication of dynamic datasets Guarding nodes allow individuals to describe their privacy requirements better Anatomy outperforms generalization by allowing much more accurate data analysis on the published data.

Thank you! Questions?