Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis University of Hong Kong (HKU) Panos Kalnis King Abdullah University of Science and Technology (KAUST)
2 Motivation Attacker can see up to m items Any m items No distinction between sensitive and non-sensitive items 0% Milk Pregnancy test Beer Helen
3 Motivation (cont.) Helen: Beer, 0% Milk, Pregnancy test John: Cola, Cheese Tom: 2% Milk, Coffee …. Mary: Wine, Beer, Full-fat Milk Database t1: Beer, 0%Milk, Pregnancy test t2: Cola, Cheese t3: 2% Milk, Coffee …. tn: Wine, Beer, Full-fat Milk Published Attacker Find all transactions that contain Beer & 0% Milk t1: Beer, Milk, Pregnancy test t2: Cola, Cheese t3: Milk, Coffee …. tn: Wine, Beer, Milk
4 k m -anonymity Set of items Transaction Database Query terms k m -anonymity:
5 Related Work: K-Anonymity [Swe02] AgeZipCodeDisease Flu AIDS Cancer Gastritis Dyspepsia Bronchitis [Swe02] L. Sweeney. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5): , (a) Microdata Quasi-identifier AgeZipCodeDisease Flu AIDS Cancer Gastritis Dyspepsia Bronchitis (a) 2-anonymous microdata NOT suitable for high-dimensionality
6 Related Work: L-diversity in Transactions [GTK08] G. Ghinita, Y. Tao, P. Kalnis, “On the Anonymization of Sparse High-Dimensional Data”, ICDE, 2008 Requires knowledge of (non)-sensitive attributes
7 Our Approach: Employs Generalization Generalization Hierarchy Information loss k=2 m=2
8 Lattice of Generalizations
9 Optimal Algorithm Q: Q: Q:
10 Count Tree All generalized forms of the paths reside in the tree We can find easily which anonymizations are needed
11 Apriori-based Anonymization Global Optimal vs Local Optimal Solution for each path We examine the paths By size (A priori principle) Paths with invalid nodes are skipped
12 Apriori-based Anonymization 1. Initialize gen_map 2. For i := 1 to m do 1. For all t D do 1. Extend t acccording to gen_map 2. Add all i-subsets of extended t to count-tree 3. Check all paths in count tree and update gen_map
13 Small Datasets (2-15K, BMS-WebView2) |I|=40..60, k=100, m=3
14 Small Datasets (BMS-WebView2) |D|=10K, k=100, m=1..4
15 Apriori Anonymization for Large Datasets 500sec 10sec 100sec |D||D||I||I| 515K K497 77K3340 k=5 m=3
16 Points to Remember Anonymization of Transactional Data Attacker knows m items Any m items can be the quasi-identifier Global recoding method Optimal solution: too slow Apriori Anonymization: fast and low information loss Extensions (VLDBJ 2010) Local recoding (sort by Gray order and partition) Global recoding (by partitioning the data domain)