Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1

Overview The problem: – Anonymizing set-valued data presents challenges not seen in relational data – Previous solutions explored parts but not all of the problem space Our goals: – Develop a scalable algorithm for the new variant of the problem – Perform experiments to explore strengths and weaknesses of the approach 2

What’s set-valued data “Relational data” – One sensitive attribute for each tuple “Set-valued data” – Logically: (personid, {item 1, item 2, …, item n }) – Multiple sensitive values in one record possible 3 ZipcodeGenderAge…Medical diagnosis 53705male30…flu 98072female40…diabetes Person IDItem set 001{milk, sunglasses, viagra} 002{beer, diapers, shampoo} 003{beer, milk, diapers, pregnancy test, diabetes medicine}

An attack scenario Retailer publishes market basket data The adversary knows Alice has bought milk, beer, and diapers The adversary infers Alice has also bought pregnancy test and diabetes medicine 4 PIDItem set 001{milk, sunglasses, viagra} 002{beer, diapers, shampoo} 003{beer, milk, diapers, pregnancy test, diabetes medicine} beer, milk, diapersbeer, milk, diapers, pregnancy test, diabetes medicine

Existing work: a priori QI/SI partition Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible – {beer, milk, diapers, pregnancy test, diabetes medicine} Substantial existing work & good algorithms – [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07] But what if a priori partitioning not possible? – Individuals may have different privacy requirements – The adversary may see sensitive items and use as QI 5 a priori QI/SI partition possible ? Set-valued data anonymization

Existing work: no QI/SI partition Prior work [Terrovitis+08] proposed the k m - anonymity model k m -anonymity – For any transaction (data record) T, for any subset of m items in T, there are at least k-1 other transactions with the same m items 6 a priori QI/SI partition possible No a priori QI/SI partition Set-valued data anonymization

The m in k m -anonymity [Terrovitis+08] Attack revisited – The data 10 3 anonymized, the adversary sees {beer, milk, diapers} Cannot tell Alice’s transaction from the other 9 – Effective assuming the adversary never sees more than m=3 items m in k m -anonymity – requires some identified m s.t. no adversary will ever see more than m items What about the case where there is no such m? – The case we consider 7 a priori QI/SI partition possible No a priori QI/SI partition Set-valued data anonymization No identified m Has identified m

Our model: k-anonymity for set-valued data Transactional database D is k-anonymous if – Every transaction (data record) occurs at least k times Different from k m -anonymity [Terrovitis+08] – no limit on m, i.e., valid for any m – thus a stronger privacy model 8

k-anonymity subsumes k m -anonymity [Terrovitis+08] Every database D that satisfies k-anonymity also satisfies k m -anonymity There exists a database D that satisfies k m -anonymity for all m but not k-anonymity – Example: 2 3 -anonymous but not 2-anonymous T 1 = {A, B, C} T 2 = {A, B, C} T 3 = {A, B} 9 a priori QI/SI partition possible Set-valued data anonymization k m -anon k-anon No QI/SI partition

Problem statement Given a transactional database D, find a transformation D’ of D s.t.: – D’ satisfies k-anonymity – the transformation minimizes information loss between (D, D’) 10

Hierarchical generalization All AlcoholHealth care Pregnancy testDiaperWineBeer Transaction generalization T i : {“Beer”, “Wine”, “Diaper”}  {“Alcohol”, “Health care”} Duplicates removed 11

Information loss metric Normalized Certainty Penalty (NCP) [Xu+06] – Also used in previous work [Terrovitis+08] 12 All AlcoholHealth care Pregnancy testDiaperWineBeer Example: – Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss – Generalize “Beer” to “All”: (4/4 = 1) info loss

Our algorithm: Partition-based anonymization Top-down – Generalize everything to the root representation – Resulting one initial partition Divide and conquer – Choose a node to specialize for each partition Based on information gain heuristics – Recursively partition on resulting sub-partitions 13

Example: 2-anonymization TIDOriginal Data2- anonymization T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } 14

Generalize all data to root TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } 15 {ALL} {a 1 } {a 1, a 2 } {b 1, b 2 } {a 1, a 2, b 2 } {a 1, a 2, b 1, b 2 } One initial partition

Initial partition: specialize using ALL  {A, B} 16 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {ALL} {A} {B} {A, B} Produces three sub-partitions

Green partition: specialize using A  {a 1, a 2 } 17 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {a 1 } {a 1, a 2 } {B} {A, B} {A} Specialization violates 2-anonymity, rolls back

Blue partition: specialize using B  {b 1, b 2 } 18 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {B} {A, B} {A} {b 1, b 2 } Specialization ok, reaches leave level, stop

Red partition: specialize using A  {a 1, a 2 } 19 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {A, B} {A} {b 1, b 2 } {a 1, a 2, B} Choosing A over B based on max info gain heurisitcs

Red partition: specialize using B  {b 1, b 2 } 20 TIDOriginal DataCurrent Representation T1T1 {a 1 } T2T2 {a 1, a 2 } T3T3 {b 1, b 2 } T4T4 T5T5 {a 1, a 2, b 2 } T6T6 T7T7 {a 1, a 2, b 1, b 2 } {A} {b 1, b 2 } {a 1, a 2, B} {a 1, a 2, b 2 } {a 1, a 2, b 1, b 2 } Specializing B violating 2-anonymity, rolls back

Main advantages Effective (less information loss) – Even though we impose a stronger privacy criteria – Local recoding vs. Global recoding Efficient (less execution time) – Divide and conquer vs. bottom-up (exhaustive) enumeration – Linear in the input data & level of the hierarchy vs. worst case exponential in previous work 21

Experimental setup: market basket data Real-world benchmark data BMS-WebView-1, BMS-WebView-2, BMS-POS No accompanying hierarchy data Used synthetic hierarchy (as in the previous work) Comparing our Partition-based algorithm (Partition), with previous Apriori- Anonymization (AA) [Terrovitis+08] 22

An order of magnitude faster on market basket data 23

Less information loss on market basket data Why? Local recoding 24

Sensitivity analysis: consistently faster with varied parameters 25

Sensitivity analysis: less information loss in most cases 26

Experimental setup: AOL query log From a set-valued perspective No accompanying hierarchy data, again – Use alphabetical hierarchy – Use WordNet hierarchy Compare with an early work [Adar07] 27

Less information loss than [Adar07] on AOL query log 28

Reasonably efficient on AOL query log 29 Efficient given the size of the query log (2.2GB) Information loss not as satisfactory as in market basket data – Words generalized to “event”, “process”, “thing”…

Conclusion Developed faster, better information preserving anonymization algorithm – for set-valued data with no QI/SI distinction Performed well on market basket data – less satisfying for search log data Open and important question: stronger privacy models – what is a good stronger privacy model than k-anonymity for set-valued data with no QI/SI distinction? 30

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.

Similar presentations

Presentation on theme: "Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.

Similar presentations

Presentation on theme: "Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1."— Presentation transcript:

Similar presentations

About project

Feedback