Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Slides:

Advertisements

Similar presentations

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

Advertisements

Anonymity for Continuous Data Publishing

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.

Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure.

Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.

M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.

Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada

In Search of Influential Event Organizers in Online Social Networks

Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.

Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.

1 Global Privacy Guarantee in Serial Data Publishing Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jia Liu 2, Ke Wang 3, Yabo Xu 4 The Hong Kong University.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

L-Diversity: Privacy Beyond K-Anonymity

PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.

Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.

Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.

Database Laboratory Regular Seminar TaeHoon Kim.

Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,

Preserving Privacy in Published Data

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

Publishing Microdata with a Robust Privacy Guarantee

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.

Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.

Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.

Refined privacy models

SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,

Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.

K-Anonymity & Algorithms

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı

Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.

1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Privacy-preserving data publishing

Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)

Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.

Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.

Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.

Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.

Versatile Publishing For Privacy Preservation

Privacy in Database Publishing

Fast Data Anonymization with Low Information Loss

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,

Private Data Management with Verification

Side-Channel Attack on Encrypted Traffic

SocialMix: Supporting Privacy-aware Trusted Social Networking Services

Personalized Privacy Protection in Social Networks

Privacy Preserving Data Publishing

Preference Query Evaluation Over Expensive Attributes

By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia

Personalized Privacy Protection in Social Networks

Presented by : SaiVenkatanikhil Nimmagadda

TELE3119: Trusted Networks Week 4

CS573 Data Privacy and Security Anonymization methods

Towards identity-anonymization on graphs

Refined privacy models

Privacy-Preserving Data Publishing

Presentation transcript:

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong m-Invariance: Towards Privacy Preserving Republication of Dynamic Datasets Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Privacy preserving data publishing Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata

Quasi-identifier (QI) attributes Inference attack Published table Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 An adversary Name Age Zipcode Bob 21 12000 Quasi-identifier (QI) attributes

Generalization Transform the QI values into less specific forms Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] generalize

Generalization Transform each QI value into a less specific form A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] An adversary Name Age Zipcode Bob 21 12000

l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. Each QI-group contains at least l “well-represented” sensitive values. e.g., in each QI-group, at most 1/l of the tuples have the same sensitive value. A 2-diverse table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

Motivating Example A hospital keeps track of the medical records collected in the last three months. The microdata table T(1), and its generalization T*(1), published in Apr. 2007. Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Microdata T(1) 2-diverse Generalization T*(1)

Motivating Example Bob was hospitalized in Mar. 2007 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(1)

Motivating Example One month later, in May 2007 Microdata T(1) Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

Motivating Example One month later, in May 2007 Some obsolete tuples are deleted from the microdata. Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

Motivating Example Bob’s tuple stays. Microdata T(1) Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Gary 41 20000 flu Jane 37 33000 Linda 43 26000 Steve 56 34000 Microdata T(1)

Motivating Example Some new records are inserted. Microdata T(2) Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2)

Motivating Example The hospital published T*(2). Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Microdata T(2) 2-diverse Generalization T*(2)

Motivating Example Consider the previous adversary. G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(2)

Motivating Example What the adversary learns from T*(1). So Bob must have contracted dyspepsia! A new generalization principle is needed. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000

The critical absence phenomenon Microdata T(2) What the adversary learns from T*(1) Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… We refer to such phenomenon as the critical absence phenomenon A new generalization method is needed.

Contributions We propose a solution for privacy preserving publication of dynamic datasets Our solution includes a framework for analyzing disclosure risk; the m-invariance principle and the counterfeited generalization technique; an efficient algorithm for computing m-invariant tables.

Related Work J. Byun, et al. Secure Anonymization for Incremental Datasets. The Third VLDB Workshop on Secure Data Management (SDM'06)

Outline Counterfeited Generalization Problem Definition Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

Counterfeited generalization T*(2) Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited generalization T*(2) Group-ID Count 1 3 The auxiliary relation R(2) for T*(2)

Counterfeited Generalization T*(2) Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Outline Counterfeited Generalization Problem Definition Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

Problem Definition A dynamic microdata table T. Denote the snapshot of T at time j as T(j). n – 1 counterfeited generalizations {T*(1), R(1)}, …, {T*(n-1), R(n-1)} have been published. Problem: given T(n), to compute a counterfeited generalization {T*(n), R(n)} of T(n), such that the publication of {T*(n), R(n)} incurs a small risk of privacy disclosure.

Outline Counterfeited Generalization Problem Definition Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

Adversary Model The adversary has the following background knowledge: the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T;

Adversary Model The adversary has the following background knowledge: the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; For instance, in our running example Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis ... … Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 ... … Microdata T(1) Microdata T(2)

Adversary Model The adversary has the following background knowledge: the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; For instance, in our running example Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis ... … Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 ... … Microdata T(1) Microdata T(2)

Adversary Model The adversary has the following background knowledge: the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; the generalization principle adopted by the data publisher.

Evaluation of Disclosure Risk Let B denote the background knowledge of the adversary. Let o be an individual with a sensitive value v. risk(o) = Pr(o has v | T*(1), R(1)…, T*(n), R(n), B ). The disclosure risk for Bob: risk(Bob) = Pr( Bob has dyspepsia | T*(1), R(1), T*(2), R(2), B ) Objective: for each individual o, risk(o) <= a threshold Name Age Zipcode Disease Bob 21 12000 dyspepsia

Outline Counterfeited Generalization Problem Definition Evaluation of Disclosure Risk The m-invariance Principle m-uniqueness signature m-invariance Experimental Results Conclusion

m-uniqueness A generalized table T*(j) is m-unique, if and only if each QI-group in T*(j) contains at least m tuples all tuples in the same QI-group have different sensitive values. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] A 2-unique generalized table

Signature The signature of Bob in T*(1) is {dyspepsia, bronchitis} The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis … Jane 4 [37, 43] [26k, 35k] Ken flu Linda gastritis T*(1)

The m-invariance principle A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

A sequence of generalized tables T. (1), …, T A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

A sequence of generalized tables T. (1), …, T A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

A sequence of generalized tables T. (1), …, T A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

The m-invariance principle Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

The m-invariance principle Lemma: if {T*(1), …, T*(n-1)} is m-invariant, then {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

Algorithm Given T(n), T*(n-1) and a parameter m, our algorithm generates a counterfeited generalization T*(n) of T(n), such that {T*(1), …, T*(n)} is m-invariant. Optimization goal: to impose as little amount of generalization as possible.

Outline Counterfeited Generalization Problem Definition Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

Experiment Settings Real dataset with 600k tuples and 6 attributes: Age, Gender, Education, Birthplace Occupation, Salary-class Two derived datasets, OCC and SAL. Two dynamic microdata tables, Tocc and Tsal.

Defect of l-diversity

Number of Counterfeits

Query Accuracy SELECT COUNT(*) FROM Tocc(j) WHERE Age < 30 AND Occupation = Manager Query error = |act – est| / act Each workload contains 10000 queries. We measure the median error of each workload.

Query Accuracy

Efficiency of Our Algorithm

Conclusions Existing solutions do not support republication of dynamic datasets We devise a framework for analyzing disclosure risks in the republication scenario. We propose the m-invariance principle and the counterfeited generalization technique. We develop an efficient algorithm for computing m-invariant tables.

Thank you for your attention!

Counterfeited Generalization T*(2) Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited Generalization T*(2) Group-ID Count 1 3 How many patients are below 30? The auxiliary relation R(2) for T*(2)

Counterfeited Generalization T*(2) Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Counterfeited Generalization T*(2) Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis c1 dyspepsia David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis Alice dyspepsia Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Defect of l-diversity risk(Bob) = 100% We say that Bob’s tuple is a vulnerable tuple. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 2-diverse T*(1) G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000 2-diverse T*(2)