CS573 Data Privacy and Security Anonymization methods

CS573 Data Privacy and Security Anonymization methods
Li Xiong

Today Clustering based anonymization (cont)
Permutation based anonymization Other privacy principles

Microaggregation/Clustering
Two steps: Partition original dataset into clusters of similar records containing at least k records For each cluster, compute an aggregation operation and use it to replace the original records e.g., mean for continuous data, median for categorical data

What is Clustering? Finding groups of objects (clusters)
Objects similar to one another in the same group Objects different from the objects in other groups Unsupervised learning Inter-cluster distances are maximized Intra-cluster distances are minimized Cluster analysis: Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms What’s not clustering: Supervised classification - Have class label information Simple segmentation - Dividing students into different registration groups alphabetically, by last name Results of a query - Groupings are a result of an external specification Graph partitioning - Some mutual relevance and synergy, but areas are not identical August 2, 2019

Clustering Approaches
Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Others August 2, 2019

K-Means Clustering: Lloyd Algorithm
Given k, and randomly choose k initial cluster centers Partition objects into k nonempty subsets by assigning each object to the cluster with the nearest centroid Update centroid, i.e. mean point of the cluster Go back to Step 2, stop when no more new assignment Based on a heuristic known as Lloyd’s algorithm. Lloyd's algorithm and k-means are often used synonymously, but in reality Lloyd's algorithm is a heuristic for solving the k-means problem August 2, 2019

The K-Means Clustering Method
Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means August 2, 2019

Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram representing a hierarchy of nested clusters Clustering obtained by cutting at desired level Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A tree like diagram that records the sequences of merges or splits A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Hierarchical Clustering
Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative Clustering Algorithm
Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains More popular hierarchical clustering technique Basic algorithm is straightforward Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix Often starts with a proximity matrix A type of graph-based algorithm

Intermediate Situation
C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 After some merging steps, we have some clusters C5 C2

How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . Similarity? MIN (Single link) MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

Distance Between Clusters
Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids Average distance between any pair of their elements davg(C, C*) = (1 / |C*||C|) ∑ d(x,y) for all elements x in C and y in C* Minimum distance between any pair of their elements (nearest neighbour clustering) dmin(C, C*) = min d(x,y) for all elements x in C and y in C* Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. Maximum distance between any pair of their elements (farthest neighbour clustering) dmin(C, C*) = max d(x,y) for all elements x in C and y in C* Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters

Clustering for Anonymization
Are they directly applicable? Which algorithms are directly applicable? K-means; hierarchical Each cluster has to have size at least k

Anonymization And Clustering
k-Member Clustering Problem From a given set of n records, find a set of clusters such that Each cluster contains at least k records, and The total intra-cluster distance is minimized. The problem is NP-complete

Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Multivariate microaggregation algorithm
Basic idea: Form two k-member clusters at each step Form one k-member cluster for remaining records, if available Form one cluster for remaining records

Multivariate microaggregation algorithm (Maximum Distance to Average Vector)
MDAV-generic(R: dataset, k: integer) while |R| ≥ 3k 1. compute average record ~x of all records in R 2. find most distant record xr from ~x 3. find most distant record xs from xr 4. form two clusters from k-1 records closest to xr and k-1 closest to xs 5. Remove the clusters from R and run MDAV-generic on the remaining dataset end while if 3k-1 ≤ |R| ≤ 2k 1. compute average record ~x of remaining records in R 2. find the most distant record xr from ~x 3. form a cluster from k-1 records closest to ~x 4. form another cluster containing the remaining records else (fewer than 2k records in R) form a new cluster from the remaining records

MDAV-generic for continuous attributes
use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances After MDAV-generic, destandardize attributes Rescaling function To avoid age vs. weight vs. income

MDAV-generic for categorical attributes
The distance between two oridinal attributes a and b in an attribute Vi: dord(a,b) = (|{i| ≤ i < b}|) / |D(Vi)| i.e., the number of categories separating a and b divided by the number of categories in the attribute The distance between two nominal attributes is defined according to equality: 0 if they're equal, else 1 Assuming total order

Empirical Results Continuous attributes Categorical attributes
From the U.S. Current Population Survey (1995) 1080 records described by 13 continuous attributes Computed k-anonymity for k = 3, ..., 9 and quasi-identifiers with 6 and 13 attributes Categorical attributes From the U.S. Housing Survey (1993) Three ordinal and eight nominal attributes Computed k-anonymity for k = 2, ..., 9 and quasi-identifiers with 3, 4, 8 and 11 attributes

IL measures for continuous attributes
IL1 = mean variation of individual attributes in original and k-anonymous datasets IL2 = mean variation of attribute means in both datasets IL3 = mean variation of attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearson's correlations IL6 = 100 times the average of IL1-6

MDAV-generic preserves means and variances (IL2 and IL3)
The impact on the non-preserved statistics grows with the quasi- identifier length, as one would expect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k

Anonymization using Microaggregation or Clustering
Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Greedy Algorithm Basic idea: Some details
Find k-member clusters, one cluster at a time Assign remaining <k points to the previous clusters Some details How to compute distances between records How to find centroid? How to find the best point to join current cluster?

Distance between two categorical values
Equally different to each other. 0 if they are the same 1 if they are different Relationships can be easily captured in a taxonomy tree. Taxonomy tree of Country Taxonomy tree of Occupation

Distance between two categorical values
Definition Let D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj  D is defined as: where  (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T. Taxonomy tree of Country Example: The distance between India and USA is /3 = 1. The distance between India and Iran is /3 = 0.66.

Cost Function - Information loss (IL)
The amount of distortion (i.e., information loss) caused by the generalization process. Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi-identifier value in the cluster. Definition: Let e = {r1, , rk} be a cluster (i.e., equivalence class). Then the amount of information loss in e, denoted by IL(e), is defined as: where |e| is the number of records in e, |N| represents the size of numeric domain N, (Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height of tree T.

Cost Function - Information loss (IL)
Example Taxonomy tree of Country Age Country Occupation Salary Diagnosis r1 41 USA Armed-Forces ≥50K Cancer r2 57 India Tech-support <50K Flu r3 40 Canada Teacher Obesity r4 38 Iran r5 24 Brazil Doctor r6 45 Greece Salesman Fever Cluster e1 Cluster e2 Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 40 Canada Teacher <50K Obesity 24 Brazil Doctor Age Country Occupation Salary Diagnosis 41 USA Armed-Forces ≥50K Cancer 57 India Tech-support <50K Flu 24 Brazil Doctor IL(e1) = 3 • D(e1) D(e1) = (41-24)/33 + (2/3) + 1 = … IL(e1) = 3 • … = … IL(e2) = 3 • D(e2) D(e2) = (57-24)/33 + (3/3) + 1 = 3 IL(e2) = 3 • 3 = 9

Greedy k-member clustering algorithm

classification metric (CM)
preserve the correlation between quasi-identifier and class labels (non-sensitive values) Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group.

Experimentl Results Experimental Setup Compare with 2 other algorithms
Data: Adult dataset from the UC Irvine Machine Learning Repository 10 attributes (2 numeric, 7 categorical, 1 class) Compare with 2 other algorithms Median partitioning (Mondrian algorithm) k-Nearest neighbor

Experimentl Results

Conclusion Transforming the k-anonymity problem to the k-member clustering problem Overall the Greedy Algorithm produced better results compared to other algorithms at the cost of efficiency

Anonymization methods
Non-perturbative: don't distort the data Generalization Suppression Perturbative: distort the data Microaggregation/clustering Additive noise Anatomization and permutation De-associate relationship between QID and sensitive attribute

Problems with Generalization and Clustering
Query A: SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000] This query would return 1 and this is associated with the first tuple in table 1 However with generalization we would actually see 2 tuples

Querying generalized table
R1 and R2 are the anonymized QID groups Q is the query range Uniform assumption p = Area(R1 ∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05 Estimated Answer for A: 4(0.05) = 0.2 An inherent problem of generalization is the fact that it prevents an analyst from correctly understanding how the data is distributed within the QI-group.

Concept of the Anatomy Algorithm
Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column Then produce a sensitive table with Disease statistics

Concept of the Anatomy Algorithm
Does it satisfy k-anonymity? l-diversity? Query results? This will preserve privacy since there is no disclosure of the sensitive value within the QIT If the adversary does know specific information, like the example with Bob earlier, at most the attacker will have a 50% probability of guessing the attribute data. Advantage of this concept is the fact that allows for better data analysis Query A will now achieve a more accurate approximate answer because no assumption of the data distribution is needed p = 0.50, so 2(0.50) = 1and 1 is the actual query result from the microdata SELECT COUNT(*) FROM Microdata WHERE Disease = 'pneumonia' AND Age <= 30 AND Zipcode IN [10001,20000]

Specifications of Anatomy cont.
DEFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (Aqi1, Aqi2, ..., Aqid, Group-ID) ST will be constructed as the following: (Group-ID, As, Count)

Privacy properties THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization
Compare with generalization on two assumptions: A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger Intention of this algorithm is not to eliminate generalization, just present an alternative option for privacy preservation Anatomy may yield a higher breach probability, but it is always bounded by 1/l suppose we don’t KNOW that someone has been to the hospital, but we have a publicly available listing of QI information.

Preserving Data Correlation
Examine the correlation between Age and Disease in T using probability density function pdf Example: t1 A good publication method should preserve both privacy and data correlation between both QI and sensitive attributes Every publication will lose some information, but it should allow usefulness for data analysis, so the quality of the correlation is dependent on the accuracy of the re-constructed model. These attributes can be defined in 2D space as DSA,D. For example tuple 1, t1, corresponds to point (t1[A],t1[D]) so t1[A] is age 23 and t1[D] is disease 'pneumonia'

Preserving Data Correlation cont.
To re-construct an approximate pdf of t1 from the generalization table:

To re-construct an approximate pdf of t1 from the QIT and ST tables:

To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5 Even though only one tuple was used, you can easily see

Idea: Measure the error for each pdf by using the following formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE): Algorithm: Nearly-Optimal Anatomizing Algorithm Since the above calculation is for a single tuple, we can expand this calculation to account

Experiments dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes Created two sets of microdata tables Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute As Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute As g

Experiments cont.

Attacks on k-Anonymity
k-Anonymity does not provide privacy if Sensitive values in an equivalence class lack diversity The attacker has background knowledge A 3-anonymous patient table Homogeneity attack Zipcode Age Disease 476** 2* Heart Disease 4790* ≥40 Flu Cancer 3* Bob Zipcode Age 47678 27 Background knowledge attack Carl Zipcode Age 47673 36

l-Diversity [Machanavajjhala et al. ICDE ‘06] Caucas 787XX Flu
Shingles Acne Asian/AfrAm 78XXX Sensitive attributes must be “diverse” within each quasi-identifier equivalence class

Distinct l-Diversity Each equivalence class has at least l well-represented sensitive values Doesn’t prevent probabilistic inference attacks 8 records have HIV 10 records 2 records have other values

Other Versions of l-Diversity
Probabilistic l-diversity The frequency of the most frequent value in an equivalence class is bounded by 1/l Entropy l-diversity The entropy of the distribution of sensitive values in each equivalence class is at least log(l) Recursive (c,l)-diversity r1<c(rl+rl+1+…+rm) where ri is the frequency of the ith most frequent value Intuition: the most frequent value does not appear too frequently

Neither Necessary, Nor Sufficient
Original dataset … Cancer Flu 99% have cancer

Original dataset Anonymization A … Cancer Flu Q1 Flu Cancer Q2 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% have cancer

Original dataset Anonymization A Anonymization B … Cancer Flu Q1 Flu Cancer Q2 Q1 Flu Cancer Q2 99% cancer  quasi-identifier group is not “diverse” …yet anonymized database does not leak anything 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% have cancer

Limitations of l-Diversity
Example: sensitive attribute is HIV+ (1%) or HIV- (99%) Very different degrees of sensitivity! l-diversity is unnecessary 2-diversity is unnecessary for an equivalence class that contains only HIV- records l-diversity is difficult to achieve Suppose there are records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes

Skewness Attack Example: sensitive attribute is HIV+ (1%) or HIV- (99%) Consider an equivalence class that contains an equal number of HIV+ and HIV- records Diverse, but potentially violates privacy! l-diversity does not differentiate: Equivalence class 1: 49 HIV+ and 1 HIV- Equivalence class 2: 1 HIV+ and 49 HIV- l-diversity does not consider overall distribution of sensitive values!

Sensitive Attribute Disclosure
A 3-diverse patient table Similarity attack Zipcode Age Salary Disease 476** 2* 20K Gastric Ulcer 30K Gastritis 40K Stomach Cancer 4790* ≥40 50K 100K Flu 70K Bronchitis 3* 60K 80K Pneumonia 90K Bob Zip Age 47678 27 Conclusion Bob’s salary is in [20k,40k], which is relatively low Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values!

t-Closeness [Li et al. ICDE ‘07] Caucas 787XX Flu Shingles Acne 78XXX
Asian/AfrAm 78XXX Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database Trick question: Why publish quasi-identifiers at all??

Anonymous, “t-Close” Dataset
Caucas 787XX HIV+ Flu Asian/AfrAm HIV- Shingles Acne This is k-anonymous, l-diverse and t-close… …so secure, right?

What Does Attacker Know?
Bob is Caucasian and I heard he was admitted to hospital with flu… Caucas 787XX HIV+ Flu Asian/AfrAm HIV- Shingles Acne Sensitive attribute as background knowledge or QI

What Does Attacker Know?
Bob is Caucasian and I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or Shingles … Caucas 787XX HIV+ Flu Asian/AfrAm HIV- Shingles Acne

k-Anonymity and Partition-based notions
Syntactic Focuses on data transformation, not on what can be learned from the anonymized dataset “k-anonymous” dataset can leak sensitive information “Quasi-identifier” fallacy Assumes a priori that attacker will not know certain information about his target Relies on locality Destroys utility of many real-world datasets

Coming up Statistical data release and differential privacy

CS573 Data Privacy and Security Anonymization methods

Similar presentations

Presentation on theme: "CS573 Data Privacy and Security Anonymization methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS573 Data Privacy and Security Anonymization methods

Similar presentations

Presentation on theme: "CS573 Data Privacy and Security Anonymization methods"— Presentation transcript:

Similar presentations

About project

Feedback