Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis

A naïve solution It does not work. See next. publish NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis

Inference attack An adversary knows that Bob –has been hospitalized before –is 23 years old –lives in an area with zipcode 11000 AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis Published table Quasi-identifier (QI) attributes

Generalization A generalized table AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 Transform each QI value into a less specific form How much generalization do we need?

l-diversity A QI-group with m tuples is l -diverse, iff each sensitive value appears no more than m / l times in the QI-group. A table is l -diverse, iff all of its QI-groups are l -diverse. The above table is 2-diverse. 2 QI-groups Quasi-identifier (QI) attributes Sensitive attribute AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis

What l-diversity guarantees From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05 Estimated answer for query A: 2 * p = 0.1 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Estimated answer from the generalized table: 0.1 NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis The exact answer should be: 1

Research Works on Generalization 1.V. S. Iyengar. Transforming data to satisfy privacy constraints. KDD 2002. 2.K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up Generalization: A Data Mining Solution to Privacy Protection. ICDM 2004. 3.R. J. Bayardo Jr. and R. Agrawal. Data Privacy through Optimal k- Anonymization. ICDE 2005. 4.B. C. M. Fung, K. Wang and P. S. Yu. Top-Down Specialization for Information and Privacy Preservation. ICDE 2005. 5.K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient Full- Domain K-Anonymity. SIGMOD 2005. 6.K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian Multidimensional K- Anonymity. ICDE 2006. 7.D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. SIGMOD 2006. 8.X. Xiao and Y. Tao. Personalized privacy preservation. SIGMOD 2006. 9.K. Wang and B. C. M. Fung. Anonymization for Sequential Releases. KDD 2006. 10.K. LeFevre, D. DeWitt and R. Ramakrishnan. Workload-Aware Anonymization. KDD 2006. 11.J. Xu, Wei Wang, J. Pei, etc. Utility-Based Anonymization Using Local Recodings. KDD 2006. 12.…

Contributions 1.We propose an alternative technique for generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy. 2.We develop an algorithm for computing anatomized tables that runs in linear I/Os (nearly) minimizes information loss

Outline Basic Idea of Anatomy Preserving Correlation Algorithm for Anatomy Experimental Results

Basic Idea of Anatomy For a given microdata table, Anatomy releases a quasi- identifier table (QIT) and a sensitive table (ST) Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 Quasi-identifier Table (QIT) Sensitive Table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis microdata

Basic Idea of Anatomy (cont.) 1. Select a partition of the tuples AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis QI group 1 QI group 2 a 2-diverse partition

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Disease pneumonia dyspepsia pneumonia flu gastritis flu bronchitis AgeSexZipcode 23M11000 27M13000 35M59000 59M12000 61F54000 65F25000 65F25000 70F30000 group 1 group 2 quasi-identifier table (QIT)sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Group-IDDisease 1pneumonia 1dyspepsia 1 1pneumonia 2flu 2gastritis 2flu 2bronchitis AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT)sensitive table (ST)

Basic Idea of Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT) sensitive table (ST)

Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000

Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT) sensitive table (ST)

Accuracy of Data Analysis (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] 2 patients have contracted pneumonia 2 out of 4 patients satisfies the query condition on Age and Zipcode Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 t1t2t3t4t1t2t3t4

Outline Rationale of Anatomy Preserving Correlation Algorithm for Anatomy Experimental Results

Preserving Correlation Let us first examine the correlation between Age and Disease in our running example Each tuple in the microdata can be mapped to a point in the (Age, Disease) domain The above tuple can be mapped to (23, pneumonia). AgeSexZipcodeDisease 23M11000pneumonia....……… t1t1

Preserving Correlation (cont.) We model this tuple using a probability density function (pdf):

Preserving Correlation (cont.) In the generalized table, the tuple becomes: Its corresponding pdf becomes: AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia …………

Preserving Correlation (cont.) In the anatomized tables, the tuple becomes: Its corresponding pdf becomes: AgeSexZipcodeGroup-ID 23M110001 ………… Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 ………

Preserving Correlation (cont.)

Quality Metric For each approximated pdf, we measure its error from the original pdf by their “L 2 distance”: We aim at obtaining anatomized tables that minimize the following re-construction error (RCE): the original pdfthe approximated pdf

Anatomize An algorithm for computing anatomized tables that –runs in I/O cost linear to the cardinality n of the microdata table –minimizes the RCE when n is a multiple of l, otherwise achieves an RCE that is higher than the lower-bound by a factor of at most 1 + 1/ n

Experimental Settings Goal: to compare the accuracy of data analysis on the generalized / anatomized tables. Real dataset with 9 attributes: –Age, Gender, Education, Marital-status, Race, Work-class, Country, –Occupation, Salary-class OCC-d, SAL-d, (d = 3, 4, 5, 6, 7) –OCC-3: –SAL-4: Cardinality: 100k, 200k, 300k, 400k, 500k AgeGenderEducationOccupation AgeGenderEducationMarital-statusSalary-class

Experimental Settings (cont.) competitor: multi-dimensional generalization l = 10 avg. relative error for 10000 aggregate queries: |act – est| / act qd = 1, 2, …, d s = 1%, …, 5%, …, 10%

Accuracy of Data Analysis (cont.) C.C. Aggarwal. On k-anonymity and the curse of dimensionality. VLDB 2005

Accuracy of Data Analysis (cont.)

Computation Overhead

Summary Anatomy outperforms generalization by allowing much more accurate data analysis on the published data. Anatomized tables (with nearly optimal quality guarantee) can be computed in I/O cost linear to the database cardinality.

Thank you! Datasets and implementation are available for download at http://www.cse.cuhk.edu.hk/~taoyf

Anatomy vs. Generalization Revisit Sometimes the adversary is not sure whether an individual appears in the microdata or not AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis A 2-diverse generalized table NameAgeSexZipcode Bob23M11000 Ken27M13000 Peter35M59000 Mark40M30000 Ric50M40000 Sam59M12000 ………… A Voter Registration List

Anatomy vs. Generalization Revisit From the adversary’s perspective: –Bob has 4 / 6 probability to be in the microdata –If Bob indeed appears the microdata, there is 2 / 4 probability that he has contracted pneumonia –So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumonia AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia ………… A 2-diverse generalized table NameAgeSexZipcode Bob23M11000 Ken27M13000 Peter35M59000 Mark40M30000 Ric50M40000 Sam59M12000 ………… A Voter Registration List

Anatomy vs. Generalization Revisit The adversary knows that –Bob must appear the microdata –There is 1/2 probability that Bob has contracted pneumonia Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 ……… AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 ………… 2-diverse QIT 2-diverse ST NameAgeSexZipcode Bob23M11000 Ken27M13000 Peter35M59000 Mark40M30000 Ric50M40000 Sam59M12000 …………

Anatomy vs. Generalization Revisit For a given value of l, l -diverse generalization may lead to higher privacy protection than l -diverse anatomy does. But is not always the case, since: –the external database may not contain any irrelevant individuals –the adversary may know that some individuals indeed appear in the microdata NameAgeSexZipcode Bob23M11000 Ken27M13000 Peter35M59000 Mark40M30000 Ric50M40000 Sam59M12000 …………

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Similar presentations

Presentation on theme: "Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Similar presentations

Presentation on theme: "Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback