Download presentation
Presentation is loading. Please wait.
1
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1
2
2 Publishing Transaction Data Publishing transaction data Retail chain-owned shopping cart data Infer consumer spending patterns Correlations among purchased items e.g., 90% of cereals buyers also buy milk What about privacy?
3
3 Privacy Threat Quasi-identifying Items Sensitive Items
4
4 Privacy Paradigm ℓ-diversity prevent association between quasi-identifier and sensitive attributes Create groups of transactions freq. of an SA value in a group < 1/p Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality
5
5 Data Re-organization Band Matrix Organization PRESERVES CORELATIONS!
6
6 Published Data Summary of Sensitive Items
7
7 Contributions Novel data representation Preserves correlation among items Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items
8
State-of-the-art: Mondrian [FWR06] Generalization-based data-space partitioning similar to k-d-trees split recursively until privacy condition does not hold constrained global recoding k = 2 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006 Age 204060 Weight 40 60 80 100 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS
9
State-of-the-art: Anatomy [XT06] Permutation-based method discloses exact QID values Disease Ulcer(1) Pneumonia(1) Flu(1) Dyspepsia(1) Gastritis(1) Dyspepsia(1) [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006 AgeZipCode 4252000 4743000 5132000 6241000 5527000 6755000 AgeZipCodeDisease 4252000Ulcer 4743000Pneumonia 5132000Flu 5527000Gastritis 6241000Dyspepsia 6755000Dyspepsia “Anatomized” table |G|! permutations RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS
10
10 Band Matrix Representation Bandwidth = U+L+1 Minimizing bandwidth is NP-hard
11
11 Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D N = matrix rows (# transactions) D = maximum degree of any vertex
12
12 Group Formation Correlation-aware Anonymization of High- Dimensional Data (CAHD) Use the order given by RCM Consecutive transactions highly correlated O(pN) complexity
13
13 Group Formation
14
Experimental Evaluation
15
15 RCM Visualization
16
16 Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM) Combines Mondrian with Anatomy Query Workload Reconstruction Error
17
17 Recostruction Error vs p
18
18 Execution Time
19
19 Conclusions Anonymizing transaction data High-dimensionality Preserving correlation Future work Different encodings for data representation Enhance correlation among consecutive rows
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.