Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.

Similar presentations


Presentation on theme: "Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1."— Presentation transcript:

1 Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1

2 Overview The problem: – Anonymizing set-valued data presents challenges not seen in relational data – Previous solutions explored parts but not all of the problem space Our goals: – Develop a scalable algorithm for the new variant of the problem – Perform experiments to explore strengths and weaknesses of the approach 2

3 What’s set-valued data “Relational data” – One sensitive attribute for each tuple “Set-valued data” – Logically: (personid, {item 1, item 2, …, item n }) – Multiple sensitive values in one record possible 3 ZipcodeGenderAge…Medical diagnosis 53705male30…flu 98072female40…diabetes Person IDItem set 001{milk, sunglasses, viagra} 002{beer, diapers, shampoo} 003{beer, milk, diapers, pregnancy test, diabetes medicine}

4 An attack scenario Retailer publishes market basket data The adversary knows Alice has bought milk, beer, and diapers The adversary infers Alice has also bought pregnancy test and diabetes medicine 4 PIDItem set 001{milk, sunglasses, viagra} 002{beer, diapers, shampoo} 003{beer, milk, diapers, pregnancy test, diabetes medicine} beer, milk, diapersbeer, milk, diapers, pregnancy test, diabetes medicine

5 Existing work: a priori QI/SI partition Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible – {beer, milk, diapers, pregnancy test, diabetes medicine} Substantial existing work & good algorithms – [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07] But what if a priori partitioning not possible? – Individuals may have different privacy requirements – The adversary may see sensitive items and use as QI 5 a priori QI/SI partition possible ? Set-valued data anonymization

6 Existing work: no QI/SI partition Prior work [Terrovitis+08] proposed the k m - anonymity model k m -anonymity – For any transaction (data record) T, for any subset of m items in T, there are at least k-1 other transactions with the same m items 6 a priori QI/SI partition possible No a priori QI/SI partition Set-valued data anonymization

7 The m in k m -anonymity [Terrovitis+08] Attack revisited – The data 10 3 anonymized, the adversary sees {beer, milk, diapers} Cannot tell Alice’s transaction from the other 9 – Effective assuming the adversary never sees more than m=3 items m in k m -anonymity – requires some identified m s.t. no adversary will ever see more than m items What about the case where there is no such m? – The case we consider 7 a priori QI/SI partition possible No a priori QI/SI partition Set-valued data anonymization No identified m Has identified m

8 Our model: k-anonymity for set-valued data Transactional database D is k-anonymous if – Every transaction (data record) occurs at least k times Different from k m -anonymity [Terrovitis+08] – no limit on m, i.e., valid for any m – thus a stronger privacy model 8

9 k-anonymity subsumes k m -anonymity [Terrovitis+08] Every database D that satisfies k-anonymity also satisfies k m -anonymity There exists a database D that satisfies k m -anonymity for all m but not k-anonymity – Example: 2 3 -anonymous but not 2-anonymous T 1 = {A, B, C} T 2 = {A, B, C} T 3 = {A, B} 9 a priori QI/SI partition possible Set-valued data anonymization k m -anon k-anon No QI/SI partition

10 Problem statement Given a transactional database D, find a transformation D’ of D s.t.: – D’ satisfies k-anonymity – the transformation minimizes information loss between (D, D’) 10

11 Hierarchical generalization All AlcoholHealth care Pregnancy testDiaperWineBeer Transaction generalization T i : {“Beer”, “Wine”, “Diaper”}  {“Alcohol”, “Health care”} Duplicates removed 11

12 Information loss metric Normalized Certainty Penalty (NCP) [Xu+06] – Also used in previous work [Terrovitis+08] 12 All AlcoholHealth care Pregnancy testDiaperWineBeer Example: – Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss – Generalize “Beer” to “All”: (4/4 = 1) info loss

13 Our algorithm: Partition-based anonymization Top-down – Generalize everything to the root representation – Resulting one initial partition Divide and conquer – Choose a node to specialize for each partition Based on information gain heuristics – Recursively partition on resulting sub-partitions 13

14 Example: 2-anonymization TIDOriginal Data2- anonymization T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } 14

15 Generalize all data to root TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } 15 {ALL} {a 1 } {a 1, a 2 } {b 1, b 2 } {a 1, a 2, b 2 } {a 1, a 2, b 1, b 2 } One initial partition

16 Initial partition: specialize using ALL  {A, B} 16 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {ALL} {A} {B} {A, B} Produces three sub-partitions

17 Green partition: specialize using A  {a 1, a 2 } 17 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {a 1 } {a 1, a 2 } {B} {A, B} {A} Specialization violates 2-anonymity, rolls back

18 Blue partition: specialize using B  {b 1, b 2 } 18 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {B} {A, B} {A} {b 1, b 2 } Specialization ok, reaches leave level, stop

19 Red partition: specialize using A  {a 1, a 2 } 19 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {A, B} {A} {b 1, b 2 } {a 1, a 2, B} Choosing A over B based on max info gain heurisitcs

20 Red partition: specialize using B  {b 1, b 2 } 20 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {A} {b 1, b 2 } {a 1, a 2, B} {a 1, a 2, b 2 } {a 1, a 2, b 1, b 2 } Specializing B violating 2-anonymity, rolls back

21 Main advantages Effective (less information loss) – Even though we impose a stronger privacy criteria – Local recoding vs. Global recoding Efficient (less execution time) – Divide and conquer vs. bottom-up (exhaustive) enumeration – Linear in the input data & level of the hierarchy vs. worst case exponential in previous work 21

22 Experimental setup: market basket data Real-world benchmark data BMS-WebView-1, BMS-WebView-2, BMS-POS No accompanying hierarchy data Used synthetic hierarchy (as in the previous work) Comparing our Partition-based algorithm (Partition), with previous Apriori- Anonymization (AA) [Terrovitis+08] 22

23 An order of magnitude faster on market basket data 23

24 Less information loss on market basket data Why? Local recoding 24

25 Sensitivity analysis: consistently faster with varied parameters 25

26 Sensitivity analysis: less information loss in most cases 26

27 Experimental setup: AOL query log From a set-valued perspective No accompanying hierarchy data, again – Use alphabetical hierarchy – Use WordNet hierarchy Compare with an early work [Adar07] 27

28 Less information loss than [Adar07] on AOL query log 28

29 Reasonably efficient on AOL query log 29 Efficient given the size of the query log (2.2GB) Information loss not as satisfactory as in market basket data – Words generalized to “event”, “process”, “thing”…

30 Conclusion Developed faster, better information preserving anonymization algorithm – for set-valued data with no QI/SI distinction Performed well on market basket data – less satisfying for search log data Open and important question: stronger privacy models – what is a good stronger privacy model than k-anonymity for set-valued data with no QI/SI distinction? 30


Download ppt "Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1."

Similar presentations


Ads by Google