Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

Similar presentations


Presentation on theme: "Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong"— Presentation transcript:

1 Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
m-Invariance: Towards Privacy Preserving Republication of Dynamic Datasets Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong

2 Privacy preserving data publishing
Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata

3 Quasi-identifier (QI) attributes
Inference attack Published table Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 An adversary Name Age Zipcode Bob 21 12000 Quasi-identifier (QI) attributes

4 Generalization Transform the QI values into less specific forms
Age Zipcode Disease 21 12000 dyspepsia 22 14000 bronchitis 24 18000 flu 23 25000 gastritis 41 20000 36 27000 37 33000 40 35000 43 26000 52 56 34000 Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] generalize

5 Generalization Transform each QI value into a less specific form
A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] An adversary Name Age Zipcode Bob 21 12000

6 l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. A generalized table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

7 l-diversity A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006. Each QI-group contains at least l “well-represented” sensitive values. e.g., in each QI-group, at most 1/l of the tuples have the same sensitive value. A 2-diverse table Age Zipcode Disease [21, 22] [12k, 14k] dyspepsia bronchitis [23, 24] [18k, 25k] flu gastritis [36, 41] [20k, 27k] [37, 43] [26k, 35k] [52, 56] [33k, 34k] 5 QI groups

8 Motivating Example A hospital keeps track of the medical records collected in the last three months. The microdata table T(1), and its generalization T*(1), published in Apr Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Microdata T(1) 2-diverse Generalization T*(1)

9 Motivating Example Bob was hospitalized in Mar. 2007
G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(1)

10 Motivating Example One month later, in May 2007 Microdata T(1) Name
Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

11 Motivating Example One month later, in May 2007
Some obsolete tuples are deleted from the microdata. Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis Gary 41 20000 Helen 36 27000 Jane 37 33000 Ken 40 35000 Linda 43 26000 Paul 52 Steve 56 34000 Microdata T(1)

12 Motivating Example Bob’s tuple stays. Microdata T(1) Name Age Zipcode
Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Gary 41 20000 flu Jane 37 33000 Linda 43 26000 Steve 56 34000 Microdata T(1)

13 Motivating Example Some new records are inserted. Microdata T(2) Name
Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2)

14 Motivating Example The hospital published T*(2).
Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Microdata T(2) 2-diverse Generalization T*(2)

15 Motivating Example Consider the previous adversary.
G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis 2 [25, 43] [21k, 33k] flu 3 [41, 46] [20k, 30k] 4 [54, 56] [31k, 34k] 5 [60, 65] [36k, 44k] Name Age Zipcode Bob 21 12000 2-diverse Generalization T*(2)

16 Motivating Example What the adversary learns from T*(1).
So Bob must have contracted dyspepsia! A new generalization principle is needed. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000

17 The critical absence phenomenon
Microdata T(2) What the adversary learns from T*(1) Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Name Age Zipcode Bob 21 12000 G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… We refer to such phenomenon as the critical absence phenomenon A new generalization method is needed.

18 Contributions We propose a solution for privacy preserving publication of dynamic datasets Our solution includes a framework for analyzing disclosure risk; the m-invariance principle and the counterfeited generalization technique; an efficient algorithm for computing m-invariant tables.

19 Related Work J. Byun, et al. Secure Anonymization for Incremental Datasets. The Third VLDB Workshop on Secure Data Management (SDM'06)

20 Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

21 Counterfeited generalization T*(2)
Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited generalization T*(2) Group-ID Count 1 3 The auxiliary relation R(2) for T*(2)

22 Counterfeited Generalization T*(2)
Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

23 Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

24 Problem Definition A dynamic microdata table T.
Denote the snapshot of T at time j as T(j). n – 1 counterfeited generalizations {T*(1), R(1)}, …, {T*(n-1), R(n-1)} have been published. Problem: given T(n), to compute a counterfeited generalization {T*(n), R(n)} of T(n), such that the publication of {T*(n), R(n)} incurs a small risk of privacy disclosure.

25 Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

26 Adversary Model The adversary has the following background knowledge:
the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T;

27 Adversary Model The adversary has the following background knowledge:
the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; For instance, in our running example Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis ... Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 ... Microdata T(1) Microdata T(2)

28 Adversary Model The adversary has the following background knowledge:
the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; For instance, in our running example Name Age Zipcode Disease Bob 21 12000 dyspepsia Alice 22 14000 bronchitis Andy 24 18000 flu David 23 25000 gastritis ... Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 ... Microdata T(1) Microdata T(2)

29 Adversary Model The adversary has the following background knowledge:
the identity and the QI values of each individual, as well as the time his/her tuple is inserted into (deleted from) T; the generalization principle adopted by the data publisher.

30 Evaluation of Disclosure Risk
Let B denote the background knowledge of the adversary. Let o be an individual with a sensitive value v. risk(o) = Pr(o has v | T*(1), R(1)…, T*(n), R(n), B ). The disclosure risk for Bob: risk(Bob) = Pr( Bob has dyspepsia | T*(1), R(1), T*(2), R(2), B ) Objective: for each individual o, risk(o) <= a threshold Name Age Zipcode Disease Bob 21 12000 dyspepsia

31 Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle m-uniqueness signature m-invariance Experimental Results Conclusion

32 m-uniqueness A generalized table T*(j) is m-unique, if and only if
each QI-group in T*(j) contains at least m tuples all tuples in the same QI-group have different sensitive values. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis 2 [23, 24] [18k, 25k] flu gastritis 3 [36, 41] [20k, 27k] 4 [37, 43] [26k, 35k] 5 [52, 56] [33k, 34k] A 2-unique generalized table

33 Signature The signature of Bob in T*(1) is {dyspepsia, bronchitis}
The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Jane 4 [37, 43] [26k, 35k] Ken flu Linda gastritis T*(1)

34 The m-invariance principle
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved.

35 A sequence of generalized tables T. (1), …, T
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

36 A sequence of generalized tables T. (1), …, T
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

37 A sequence of generalized tables T. (1), …, T
A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if T*(1), …, T*(n) are m-unique, and each individual has the same signature in every generalized table s/he is involved. Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Generalization T*(2)

38 The m-invariance principle
Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

39 The m-invariance principle
Lemma: if {T*(1), …, T*(n-1)} is m-invariant, then {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

40 Algorithm Given T(n), T*(n-1) and a parameter m, our algorithm generates a counterfeited generalization T*(n) of T(n), such that {T*(1), …, T*(n)} is m-invariant. Optimization goal: to impose as little amount of generalization as possible.

41 Outline Counterfeited Generalization Problem Definition
Evaluation of Disclosure Risk The m-invariance Principle Experimental Results Conclusion

42 Experiment Settings Real dataset with 600k tuples and 6 attributes:
Age, Gender, Education, Birthplace Occupation, Salary-class Two derived datasets, OCC and SAL. Two dynamic microdata tables, Tocc and Tsal.

43 Defect of l-diversity

44 Number of Counterfeits

45 Query Accuracy SELECT COUNT(*) FROM Tocc(j)
WHERE Age < 30 AND Occupation = Manager Query error = |act – est| / act Each workload contains queries. We measure the median error of each workload.

46 Query Accuracy

47 Efficiency of Our Algorithm

48 Conclusions Existing solutions do not support republication of dynamic datasets We devise a framework for analyzing disclosure risks in the republication scenario. We propose the m-invariance principle and the counterfeited generalization technique. We develop an efficient algorithm for computing m-invariant tables.

49 Thank you for your attention!

50 Counterfeited Generalization T*(2)
Name Group-ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name Age Zipcode Disease Bob 21 12000 dyspepsia David 23 25000 gastritis Emily 25 21000 flu Jane 37 33000 Linda 43 26000 Gary 41 20000 Mary 46 30000 Ray 54 31000 Steve 56 34000 Tom 60 44000 Vince 65 36000 Microdata T(2) Counterfeited Generalization T*(2) Group-ID Count 1 3 How many patients are below 30? The auxiliary relation R(2) for T*(2)

51 Counterfeited Generalization T*(2)
Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia c1 bronchitis David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] dyspepsia Alice bronchitis Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

52 Counterfeited Generalization T*(2)
Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis c1 dyspepsia David 2 [23, 25] [21k, 25k] gastritis Emily flu Jane 3 [37, 43] [26k, 33k] c2 Linda Gary 4 [41, 46] [20k, 30k] Mary Ray 5 [54, 56] [31k, 34k] Steve Tom 6 [60, 65] [36k, 44k] Vince Name G.ID Age Zipcode Disease Bob 1 [21, 22] [12k, 14k] bronchitis Alice dyspepsia Andy 2 [23, 24] [18k, 25k] flu David gastritis Gary 3 [36, 41] [20k, 27k] Helen Jane 4 [37, 43] [26k, 35k] Ken Linda Paul 5 [52, 56] [33k, 34k] Steve Generalization T*(1) Counterfeited Generalization T*(2) Group-ID Count 1 3 Name Age Zipcode Bob 21 12000 The auxiliary relation R(2) for T*(2)

53 Defect of l-diversity risk(Bob) = 100%
We say that Bob’s tuple is a vulnerable tuple. G. ID Age Zipcode Disease 1 [21, 22] [12k, 14k] dyspepsia bronchitis …… Name Age Zipcode Bob 21 12000 2-diverse T*(1) G. ID Age Zipcode Disease 1 [21, 23] [12k, 25k] dyspepsia gastritis …… Name Age Zipcode Bob 21 12000 2-diverse T*(2)


Download ppt "Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong"

Similar presentations


Ads by Google