Data Anonymization (1)
Outline Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data
The Massachusetts Governor Privacy Breach Name SSN Visit Date Diagnosis Procedure Medication Total Charge Name Address Date Registered Party affiliation Date last voted Zip Birth date Sex Medical DataVoter List Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis Zip Birth date Sex Sweeney, IJUFKS 2002 Quasi Identifier 87 % of US population 3
Definition Table Column: attributes, row: records Quasi-identifier A list of attributes that can potentially be used to identify individuals K-anonymity Any QI in the table appears at least k times
Basic techniques Generalization Zip {02138, 02139} 0213* Domain generalization hierarchy A0 A1 … An Eg. {02138, 02139} 0213* 021* 02* 0* * This hierarchy is a tree structure suppression
Balance Better privacy guarantee Lower data utility There are many schemes satisfying the k-anonymity specification. We want to minimize the distortion of table, in order to maximize data utility Suppression is required if we cannot find a k-anonymity group for a record.
Criteria Minimal generalization Minimal generalization that satisfy the k- anonymization specification Minimal table distortion Minimal generalization with minimal utility loss Use precision to evaluate the loss [sweeny papers] Application-specific utility
Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are approximate algorithms
Shared features in different solutions Always satisfy the k-anonymity specification If some records not, suppress them Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric Algorithms Assume the domain generalization hierarchy is given Efficiency Utility maximization
Metrics to be optimized Two cost metrics – we want to minimize (bayardo ICDE05) Discernibility Classification The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group
metrics A combination of information loss and anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric
metrics Information loss Dataset has class labels Entropy a set S, labeled by different classes Entropy is used to calculate the impurity of labels Information loss of a generalization G {c1,c2,…cn} p I(G) = info(Sp) - info (Rci) Pi is the percentage of label iInfo(S)=
Anonymity gain A(VID) : # of records with the VID A G (VID) >= A(VID): generalization improves or does not change A(VID) Anonymity gain P(G) = x – A(VID) x = A G (VID) if A G (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain
Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up algorithms They are all dimension-by-dimension methods
Multidimensional techniques Categorical data? Categories are mapped to numerize the categories Bayardo 95 paper Order matters? (no research on that) Numerical data K-anonymization n-dim space partitioning Many existing techniques can be applied
Single-dimensional vs. multidimensional
The evolving procedure Categorical(domain hierarchy)[sweeney, top- down/bottom-up] numerized categories, single dimensional [bayardo05] numerized/numerical multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain Numerize categorical data Apply a top-down partioning process step1 Step2.1 Step2.2
Allowable cut
Method 2: spatial indexing Multidimensional spatial techniques Kd-tree (similar to Mondrain algorithm) R-tree and its variations R-tree R+-tree Leaf layer Upper layer
Compacting bounds Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done. Information is better preserved
Benefits of using R+-Tree Scalable: originally designed for indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality
Performance Mondrain
Utility Metrics Discenibility penalty KL divergence: describe the difference between a pair of distributions Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range
Other issues Sparse high-dimensionality Transactional data boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 Relate to the clustering problem of transactional data! The above one uses matrix-based clustering item based clustering (?)
Other issues Effect of numerizing categorical data Ordering of categories may have certain impact on quality General-purpose utility metrics vs. special task oriented utility metrics Attacks on k-anonymity definition