Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Clustering.
On method-specific record linkage for risk assessment Jordi Nin Javier Herranz Vicenç Torra.
Clustering Categorical Data The Case of Quran Verses
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
T OWARDS P RIVACY -S ENSITIVE P ARTICIPATORY S ENSING K.L. Huang, S. S. Kanhere and W. Hu Presented by Richard Lin Zhou.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
BHS Methods in Behavioral Sciences I April 18, 2003 Chapter 4 (Ray) – Descriptive Statistics.
Statistics for the Social Sciences
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Reduced Support Vector Machine
Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Cluster Analysis (1).
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Module 04: Algorithms Topic 07: Instance-Based Learning
Gene expression & Clustering (Chapter 10)
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Principles of Pattern Recognition
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Network Aware Resource Allocation in Distributed Clouds.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Chapter 2: Getting to Know Your Data
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
Privacy-preserving data publishing
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Descriptive Statistics ( )
Data Transformation: Normalization
Machine Learning for the Quantified Self
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Data Science Algorithms: The Basic Methods
Data Mining: Concepts and Techniques
Numerical Descriptive Measures
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
Collaborative Filtering Nearest Neighbor Approach
CS573 Data Privacy and Security Anonymization methods
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Nearest Neighbors CSC 576: Data Mining.
Group 9 – Data Mining: Data
CS573 Data Privacy and Security Anonymization methods
Presentation transcript:

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Anonymization Methods Perturbative: distort the data  statistics computed on the perturbed dataset should not differ significantly from the original  microaggregation, additive noise Non-perturbative: don't distort the data  generalization: combine several categories to form new less specific category  suppression: remove values of a few attributes in some records, or entire records

Types of data  Continuous: attribute is numeric and arithmetic operations can be performed on it  Categorical: attribute takes values over a finite set and standard arithmetic operations don't make sense Ordinal: ordered range of categories  ≤, min and max operations are meaningful Nominal: unordered  only equality comparison operation is meaningful

Measure tradeoffs k-Anonymity: a dataset satisfies k-anonymity for k > 1 if at least k records exist for each combination of quasi- identifier values assuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures

 Satisfying k-anonymity using generalization and suppression is NP-hard  Computational cost of finding the optimal generalization  How to determine the subset of appropriate generalizations semantics of categories and intended use of data e.g., ZIP code:  {08201, 08205} -> 0820* makes sense  {08201, 05201} -> 0*201 doesn't Critique of Generalization/Suppression

Problems cont.  How to apply a generalization globally  may generalize records that don't need it locally  difficult to automate and analyze  number of generalizations is even larger  Generalization and suppression on continuous data are unsuitable a numeric attribute becomes categorical and loses its numeric semantics

Problems cont.  How to optimally combine generalization and suppression is unknown  Use of suppression is not homogenous suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value

Microaggregation/Clustering Two steps: Partition original dataset into clusters of similar records containing at least k records For each cluster, compute an aggregation operation and use it to replace the original records e.g., mean for continuous data, median for categorical data

Advantages:  a unified approach, unlike combination of generalization and suppression  Near-optimal heuristics exist  Doesn't generate new categories  Suitable for continuous data without removing their numeric semantics

Advantages cont. Reduces data distortion K-anonymity requires an attribute to be generalized or suppressed, even if all but one tuple in the set have the same value. Clustering allows a cluster center to be published instead, “enabling us to release more information.”

Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

2-Anonymity with Generalization AgeSalary Amy Brian Carol David Evelyn Generalization allows pre-specified ranges

2-Anonymity with Clustering AgeSalary Amy[25-29][50-100] Brian[25-29][50-100] Carol[25-29][50-100] David[35-39][ ] Evelyn[35-39][ ] Cluster centers ([27,70], [37,115]) published 27=( )/3 70=( )/3 37=(35+39)/2 115=( )/2

Another example: no common value among each attribute

Generalization vs. clustering Generalized version of the table would need to suppress all attributes. Clustered Version of the table would publish the cluster center as (1, 1, 1, 1), and the radius as 1.

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

Multivariate microaggregation algorithm  MDAV-generic: Generic version of MDAV algorithm (Maximum Distance to Average Vector) from previous papers  Works with any type of data (continuous, ordinal, nominal), aggregation operator and distance calculation

MDAV-generic(R: dataset, k: integer) while |R| ≥ 3k 1. compute average record ~x of all records in R 2. find most distant record x r from ~x 3. find most distant record x s from x r 4. form two clusters from k-1 records closest to x r and k-1 closest to x s 5. Remove the clusters from R and run MDAV-generic on the remaining dataset end while if 3k-1 ≤ |R| ≤ 2k 1. compute average record ~x of remaining records in R 2. find the most distant record x r from ~x 3. form a cluster from k-1 records closest to ~x 4. form another cluster containing the remaining records else (fewer than 2k records in R) form a new cluster from the remaining records

MDAV-generic for continuous attributes  use arithmetic mean and Euclidean distance  standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances  After MDAV-generic, destandardize attributes x ij is value of k-anonymized jth attribute for the ith record m 1 0 (j) and m 2 (j) are mean and variance of the k-anonymized jth attribute u 1 0 (j) and u 2 (j) are mean and variance of the original jth attribute

MDAV-generic for ordinal attributes  The distance between two categories a and b in an attribute V i : d ord (a,b) = (|{i| ≤ i < b}|) / |D(V i )|  i.e., the number of categories separating a and b divided by the number of categories in the attribute Nominal attributes  The distance between two values is defined according to equality: 0 if they're equal, else 1

Empirical Results Continuous attributes From the U.S. Current Population Survey (1995) 1080 records described by 13 continuous attributes Computed k-anonymity for k = 3,..., 9 and quasi- identifiers with 6 and 13 attributes Categorical attributes From the U.S. Housing Survey (1993) Three ordinal and eight nominal attributes Computed k-anonymity for k = 2,..., 9 and quasi- identifiers with 3, 4, 8 and 11 attributes

IL measures for continuous attributes  IL1 = mean variation of individual attributes in original and k-anonymous datasets  IL2 = mean variation of attribute means in both datasets  IL3 = mean variation of attribute variances  IL4 = mean variation of attribute covariances  IL5 = mean variation of attribute Pearson's correlations  IL6 = 100 times the average of IL1-6

MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasi- identifier length, as one would expect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k

IL measures for categorical attributes  Dist: direct comparison of original and protected values using a categorical distance  CTBIL': mean variation of frequencies in contingency tables for original and protected data (based on another paper by Domingo-Ferrer and Torra)  ACTBIL': CTBIL' divided by the total number of cells in all considered tables  EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra)

Ordinal attribute protection using median

Ordinal attribute protection using convex median

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

r-Clustering Attributes from a table are first redefined as points in metric space. These points are clustered, and then the cluster centers are published, rather than the original quasi-identifiers. r is the lower bound on the number of members in each cluster. r is used instead of k to denote the minimum degree of anonymity because k is typically used in clustering to denote the number of clusters.

Data published for clusters Three features are published for the clustered data the quasi-identifying attributes of the cluster center the number of points within the cluster the set of sensitive values for the cluster (which remain unchanged, as with k- anonymity) A measure of the quality of the clusters will also be published.

Defining the records in metric space Some attributes, such as age and height, are easily mapped to metric space. Others, such as zip may first need to be converted, for example to longitude and latitude. Some attributes may need to be scaled, such as location, which may differ by thousands of miles. Some attributes such as race or nationality may not convert to points in metric space easily.

How to measure the quality of the cluster Measures how much it distorts the original data. Maximum radius (r-GATHER problem) Maximum radius of all clusters Cellular cost (r-CELLULAR CLUSTERING problem) Each cluster incurs a “facility cost” to set up the cluster center. Each cluster incurs a “service cost” which is equal to the radius times the number of points in the cluster Sum of the facility and services costs for each of the clusters.

25 points, radius points, radius 8 17 points, radius 7 Points arranged in clusters

Cluster quality measurements Maximum radius = 10 Facility cost plus service cost:  Facility cost = f(c)  Service cost = (17 x 7) + (14 x 8) + (25 x 10) = 481

r-GATHER problem “The r-Gather problem is to cluster n points in a metric space into a set of clusters, such that each cluster has at least r points. The objective is to minimize the maximum radius among the clusters.”

“Outlier” points r-GATHER and r-CELLULAR CLUSTERING, like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data). The clustering solutions in this paper are generalized to allow an e fraction of outliers to be removed from the data, that is, e fraction of the tuples can be suppressed.

(r,e)-GATHER Clustering The (r, e)-GATHER clustering formulation of the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed). The paper finds that there is a polynomial time algorithm that provides a 4- approximation for the (r,e)-GATHER problem.

r-CELLULAR CLUSTERING defined The CELLULAR CLUSTERING problem is to arrange n points into clusters with each cluster has at least r points and with the minimum total cellular cost.

(r,e)-CELLULAR CLUSTERING There is also a (r,e)-CELLULAR CLUSTERING problem in which an e fraction of the points can be excluded. The details of the constant-factor approximation of this problem are deferred to the full version of this paper.

Anonymization using Microaggregation or Clustering Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007

41 Anonymization And Clustering k-Member Clustering Problem From a given set of n records, find a set of clusters such that Each cluster contains at least k records, and The total intra-cluster distance is minimized. The problem is NP-complete

42 Distance Metrics Distance metric for records Measure the dissimilarities between two data points Sum of all dissimilarities between corresponding attributes.  Numerical values  Categorical values

43 Distance between two numerical values Definition Let D be a finite numeric domain. Then the normalized distance between two values v i, v j  D is defined as: where |D| is the domain size measured by the difference between the maximum and minimum values in D. AgeCountryOccupationSalaryDiagnosis r141USAArmed-Forces≥50KCancer r257IndiaTech-support<50KFlu r340CanadaTeacher<50KObesity r438IranTech-support≥50KFlu r524BrazilDoctor≥50KCancer r645GreeceSalesman<50KFever Example 1 Distance between r1 and r2 with respect to Age attribute is |57-41|/|57-24| = 16/33 = Example 2 Distance between r5 and r6 with respect to Age attribute is |24-45|/|57-24| = 21/33 =

44 Distance between two categorical values Equally different to each other. 0 if they are the same 1 if they are different Relationships can be easily captured in a taxonomy tree. Taxonomy tree of Country Taxonomy tree of Occupation

45 Distance between two categorical values Definition Let D be a categorical domain and TD be a taxonomy tree defined for D. The normalized distance between two values v i, v j  D is defined as: where  (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T. Taxonomy tree of Country Example: The distance between India and USA is 3/3 = 1. The distance between India and Iran is 2/3 = 0.66.

46 Distance between two records Definition Let Q T = {N 1,...,N m, C 1,...,C n } be the quasi-identifier of table T, where N i (i = 1,...,m) is an attribute with a numeric domain and C i (i = 1,..., n) is an attribute with a categorical domain. The distance of two records r 1, r 2  T is defined as: where δ N is the distance function for numeric attribute, and δ C is the distance function for categorical attribute.

47 Distance between two records Continued… Taxonomy tree of Country Taxonomy tree of Occupation AgeCountryOccupationSalary Diagno sis r141USAArmed-Forces≥50KCancer r257IndiaTech-support<50KFlu r340CanadaTeacher<50KObesity r438IranTech-support≥50KFlu r524BrazilDoctor≥50KCancer r645GreeceSalesman<50KFever Example the distance between the r1 and r2 is (16/33) + (3/3) + 1 = the distance between the r1 and r3 is (1/33) + (1/3) + 1 =

48 Cost Function - Information loss (IL) The amount of distortion (i.e., information loss) caused by the generalization process. Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi- identifier value in the cluster. Definition: Let e = {r 1,..., r k } be a cluster (i.e., equivalence class). Then the amount of information loss in e, denoted by IL(e), is defined as: where |e| is the number of records in e, |N| represents the size of numeric domain N,  (  C j ) is the subtree rooted at the lowest common ancestor of every value in  C j, and H(T) is the height of tree T.

49 Cost Function - Information loss (IL) Taxonomy tree of Country AgeCountryOccupationSalaryDiagnosis r141USAArmed-Forces≥50KCancer r257IndiaTech-support<50KFlu r340CanadaTeacher<50KObesity r438IranTech-support≥50KFlu r524BrazilDoctor≥50KCancer r645GreeceSalesman<50KFever IL(e 1 ) = 3 D(e 1 ) D(e 1 ) = (41-24)/33 + (2/3) + 1 = … IL(e 1 ) = … = … AgeCountryOccupationSalaryDiagnosis 41USAArmed-Forces≥50KCancer 40CanadaTeacher<50KObesity 24BrazilDoctor≥50KCancer Cluster e 1 Example IL(e 2 ) = 3 D(e 2 ) D(e 2 ) = (57-24)/33 + (3/3) + 1 = 3 IL(e 2 ) = 3 3 = 9 AgeCountryOccupationSalaryDiagnosis 41USAArmed-Forces≥50KCancer 57IndiaTech-support<50KFlu 24BrazilDoctor≥50KCancer Cluster e 2

50 Greedy k-member clustering algorithm

51 Diversity Metrics The Equal Diversity metric (ED) assumes all sensitive attribute values are equally sensitive where φ(e, s) = 1 if every record in e has the same s value; φ(e, s) = 0, otherwise. Modification to the greedy algorithm: Sensitive Diversity metric (SD) assumes there are two types of values in a sensitive attribute: truly-sensitive not-so-sensitive where ψ(e, s) = 1 if every record in e has the same s value that is truly- sensitive; ψ(e, s) = 0, otherwise Modification to the greedy algorithm

52 classification metric (CM) preserve the correlation between quasi- identifier and class labels (non-sensitive values) Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class label of the majority in the equivalence group. Modification to the greedy algorithm:

53 Experimentl Results Experimentl Setup Data: Adult dataset from the UC Irvine Machine Learning Repository 10 attributes (2 numeric, 7 categorical, 1 class) Compare with 2 other algorithms Median partitioning (mondrian algorithm) k-Nearest neighbor

54 Experimentl Results

55 Conclusion Transforming the k-anonymity problem to the k-member clustering problem Overall the Greedy Algorithm produced better results compared to other algorithms at the cost of efficiency