Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
m-Invariance: Towards Privacy Preserving Republication of Dynamic Datasets Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Privacy preserving data publishing
Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata

Quasi-identifier (QI) attributes
Inference attack Published table Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 An adversary Name Age Zipcode Bob 21 12000 Quasi-identifier (QI) attributes

Generalization Transform the QI values into less specific forms
Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] generalize

Generalization Transform each QI value into a less specific form
A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] An adversary Name Age Zipcode Bob 21 12000

l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. Each QI-group contains at least l “well-represented” sensitive values. e.g., in each QI-group, at most 1/l of the tuples have the same sensitive value. A 2-diverse table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

Motivating Example A hospital keeps track of the medical records collected in the last three months. The microdata table T(1), and its generalization T*(1), published in Apr Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Microdata T(1) 2-diverse Generalization T*(1)

Motivating Example Bob was hospitalized in Mar. 2007
G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(1)

Motivating Example One month later, in May 2007 Microdata T(1) Name
Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

Motivating Example One month later, in May 2007
Some obsolete tuples are deleted from the microdata. Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

Motivating Example Bob’s tuple stays. Microdata T(1) Name Age Zipcode
Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Gary 41 20000 flu Jane 37 33000 Linda 43 26000 Steve 56 34000 Microdata T(1)

Motivating Example Some new records are inserted. Microdata T(2) Name
Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2)

Motivating Example The hospital published T*(2).
Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Microdata T(2) 2-diverse Generalization T*(2)

Motivating Example Consider the previous adversary.
G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(2)

Motivating Example What the adversary learns from T*(1).
So Bob must have contracted dyspepsia! A new generalization principle is needed. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000

The critical absence phenomenon
Microdata T(2) What the adversary learns from T*(1) Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… We refer to such phenomenon as the critical absence phenomenon A new generalization method is needed.

Contributions We propose a solution for privacy preserving publication of dynamic datasets Our solution includes a framework for analyzing disclosure risk; the m-invariance principle and the counterfeited generalization technique; an efficient algorithm for computing m-invariant tables.

Related Work J. Byun, et al. Secure Anonymization for Incremental Datasets. The Third VLDB Workshop on Secure Data Management (SDM'06)

Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

Counterfeited generalization T*(2)
Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited generalization T*(2) Group-ID Count 1 3 The auxiliary relation R(2) for T*(2)

Counterfeited Generalization T*(2)
Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Problem Definition A dynamic microdata table T.
Denote the snapshot of T at time j as T(j). n – 1 counterfeited generalizations {T*(1), R(1)}, …, {T*(n-1), R(n-1)} have been published. Problem: given T(n), to compute a counterfeited generalization {T*(n), R(n)} of T(n), such that the publication of {T*(n), R(n)} incurs a small risk of privacy disclosure.

Adversary Model The adversary has the following background knowledge:
the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T;

the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; For instance, in our running example Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis ... … Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 ... … Microdata T(1) Microdata T(2)

the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; the generalization principle adopted by the data publisher.

Evaluation of Disclosure Risk
Let B denote the background knowledge of the adversary. Let o be an individual with a sensitive value v. risk(o) = Pr(o has v | T*(1), R(1)…, T*(n), R(n), B ). The disclosure risk for Bob: risk(Bob) = Pr( Bob has dyspepsia | T*(1), R(1), T*(2), R(2), B ) Objective: for each individual o, risk(o) <= a threshold Name Age Zipcode Disease Bob 21 12000 dyspepsia

Evaluation of Disclosure Risk The m-invariance Principle m-uniqueness signature m-invariance Experimental Results Conclusion

m-uniqueness A generalized table T*(j) is m-unique, if and only if
each QI-group in T*(j) contains at least m tuples all tuples in the same QI-group have different sensitive values. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] A 2-unique generalized table

Signature The signature of Bob in T*(1) is {dyspepsia, bronchitis}
The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis … Jane 4 [37, 43] [26k, 35k] Ken flu Linda gastritis T*(1)

The m-invariance principle
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

A sequence of generalized tables T. (1), …, T
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

Lemma: if {T*(1), …, T*(n-1)} is m-invariant, then {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

Algorithm Given T(n), T*(n-1) and a parameter m, our algorithm generates a counterfeited generalization T*(n) of T(n), such that {T*(1), …, T*(n)} is m-invariant. Optimization goal: to impose as little amount of generalization as possible.

Experiment Settings Real dataset with 600k tuples and 6 attributes:
Age, Gender, Education, Birthplace Occupation, Salary-class Two derived datasets, OCC and SAL. Two dynamic microdata tables, Tocc and Tsal.

Defect of l-diversity

Number of Counterfeits

Query Accuracy SELECT COUNT(*) FROM Tocc(j)
WHERE Age < 30 AND Occupation = Manager Query error = |act – est| / act Each workload contains queries. We measure the median error of each workload.

Query Accuracy

Efficiency of Our Algorithm

Conclusions Existing solutions do not support republication of dynamic datasets We devise a framework for analyzing disclosure risks in the republication scenario. We propose the m-invariance principle and the counterfeited generalization technique. We develop an efficient algorithm for computing m-invariant tables.

Thank you for your attention!

Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited Generalization T*(2) Group-ID Count 1 3 How many patients are below 30? The auxiliary relation R(2) for T*(2)

Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis c1 dyspepsia David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis Alice dyspepsia Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

Defect of l-diversity risk(Bob) = 100%
We say that Bob’s tuple is a vulnerable tuple. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 2-diverse T*(1) G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000 2-diverse T*(2)

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Similar presentations

Presentation on theme: "Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Similar presentations

Presentation on theme: "Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong"— Presentation transcript:

Similar presentations

About project

Feedback