Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
Data Mining Techniques: Clustering
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Spatial Mining.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Basic Data Mining Techniques Chapter Decision Trees.
Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
Basic Data Mining Techniques
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Preserving Privacy in Clickstreams Isabelle Stanton.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
Database Laboratory Regular Seminar TaeHoon Kim.
Preserving Privacy in Published Data
Inductive learning Simplest form: learn a function from examples
Publishing Microdata with a Robust Privacy Guarantee
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Refined privacy models
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
K-Anonymity & Algorithms
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Mining various kinds of Association Rules
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.
Anonymity and Privacy Issues --- re-identification
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.
Data Mining and Decision Support
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Data Security and Privacy Keke Chen
University of Texas at El Paso
Data Transformation: Normalization
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
CSE572, CBS598: Data Mining by H. Liu
Presented by : SaiVenkatanikhil Nimmagadda
Publisher : TRANSACTIONS ON NETWORKING Author : Haoyu Song, Jonathan S
Text Categorization Berlin Chen 2003 Reference:
Refined privacy models
Presentation transcript:

Data Anonymization (1)

Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

The Massachusetts Governor Privacy Breach Name SSN Visit Date Diagnosis Procedure Medication Total Charge Name Address Date Registered Party affiliation Date last voted Zip Birth date Sex Medical DataVoter List Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis Zip Birth date Sex Sweeney, IJUFKS 2002 Quasi Identifier 87 % of US population 3

Definition  Table Column: attributes, row: records  Quasi-identifier A list of attributes that can potentially be used to identify individuals  K-anonymity Any QI in the table appears at least k times

Basic techniques  Generalization Zip {02138, 02139}  0213* Domain generalization hierarchy  A0  A1  …  An  Eg. {02138, 02139}  0213*  021*  02*  0*  *  This hierarchy is a tree structure suppression

 Balance Better privacy guarantee Lower data utility There are many schemes satisfying the k-anonymity specification. We want to minimize the distortion of table, in order to maximize data utility Suppression is required if we cannot find a k-anonymity group for a record.

Criteria  Minimal generalization Minimal generalization that satisfy the k- anonymization specification  Minimal table distortion Minimal generalization with minimal utility loss Use precision to evaluate the loss [sweeny papers] Application-specific utility

Complexity of finding optimal solution on generalization  NP-hard (bayardo ICDE05)  So all proposed algorithms are approximate algorithms

Shared features in different solutions  Always satisfy the k-anonymity specification If some records not, suppress them  Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric  Algorithms Assume the domain generalization hierarchy is given Efficiency Utility maximization

Metrics to be optimized  Two cost metrics – we want to minimize (bayardo ICDE05) Discernibility Classification  The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group

metrics  A combination of information loss and anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

metrics  Information loss Dataset has class labels Entropy  a set S, labeled by different classes  Entropy is used to calculate the impurity of labels  Information loss of a generalization G {c1,c2,…cn}  p I(G) = info(Sp) - info (Rci) Pi is the percentage of label iInfo(S)=

 Anonymity gain A(VID) : # of records with the VID A G (VID) >= A(VID): generalization improves or does not change A(VID) Anonymity gain P(G) = x – A(VID) x = A G (VID) if A G (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain

 Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)

Domain-hierarchy based algorithms  The sweeny’s algorithm  Bayardo’s tree pruning algorithm  Wang’s top-down and bottom up algorithms  They are all dimension-by-dimension methods

Multidimensional techniques  Categorical data? Categories are mapped to numerize the categories  Bayardo 95 paper  Order matters? (no research on that)  Numerical data K-anonymization  n-dim space partitioning Many existing techniques can be applied

Single-dimensional vs. multidimensional

The evolving procedure Categorical(domain hierarchy)[sweeney, top- down/bottom-up]  numerized categories, single dimensional [bayardo05]  numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Method 1: Mondrain  Numerize categorical data  Apply a top-down partioning process step1 Step2.1 Step2.2

Allowable cut

Method 2: spatial indexing  Multidimensional spatial techniques Kd-tree (similar to Mondrain algorithm) R-tree and its variations R-tree R+-tree Leaf layer Upper layer

Compacting bounds Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done. Information is better preserved

Benefits of using R+-Tree  Scalable: originally designed for indexing disk-based large data  Multi-granularity k-anonymity: layers  Better performance  Better quality

Performance Mondrain

Utility  Metrics Discenibility penalty KL divergence: describe the difference between a pair of distributions Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Other issues  Sparse high-dimensionality Transactional data  boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 Relate to the clustering problem of transactional data!  The above one uses matrix-based clustering  item based clustering (?)

Other issues  Effect of numerizing categorical data Ordering of categories may have certain impact on quality  General-purpose utility metrics vs. special task oriented utility metrics  Attacks on k-anonymity definition