Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.

Slides:



Advertisements
Similar presentations
Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
Advertisements

Greedy Algorithms.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Dynamic Programming.
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Exploratory Data Mining and Data Preparation
Semidefinite Programming
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Protecting Privacy when Disclosing Information Pierangela Samarati Latanya Sweeney.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.
Preserving Privacy in Published Data
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Data mining and machine learning A brief introduction.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
CS573 Data Privacy and Security Statistical Databases
Database Management 9. course. Execution of queries.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Mining various kinds of Association Rules
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Lecture 3: Uninformed Search
1 Solving problems by searching 171, Class 2 Chapter 3.
CS4432: Database Systems II Query Processing- Part 2.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Privacy-preserving data publishing
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Artificial Intelligence Solving problems by searching.
Lecture 3: Uninformed Search
Adversarial Search and Game-Playing
Fast Data Anonymization with Low Information Loss
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Data Driven Resource Allocation for Distributed Learning
Automatic Physical Design Tuning: Workload as a Sequence
Sungho Kang Yonsei University
Data Transformations targeted at minimizing experimental variance
TELE3119: Trusted Networks Week 4
CS639: Data Management for Data Science
Towards identity-anonymization on graphs
Presentation transcript:

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

Generalization and Suppression Generalization Replace the value with a less specific but semantically consistent value #ZipAgeNationalityCondition < 40*Heart Disease < 40*Heart Disease < 40*Cancer < 40*Cancer Suppression Do not release a value at all Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person}

3 Complexity Search Space: Number of generalizations =  (Max level of generalization for attribute i + 1) attrib i If we allow generalization to a different level for each value of an attribute: Number of generalizations =  (Max level of generalization for attribute i + 1) attrib i #tuples

Hardness result Given some data set R and a QI Q, does R satisfy k-anonymity over Q? Easy to tell in polynomial time, NP! Finding an optimal anonymization is not easy NP-hard: reduction from k-dimensional perfect matching A polynomial solution implies P = NP A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS’04.

Taxonomy of Generalization Algorithms Top-down specialization vs. bottom-up generalization Global (single dimensional) vs. local (multi- dimensional) Complete (optimal) vs. greedy (approximate) Hierarchy-based (user defined) vs. partition- based (automatic) K. LeFerve, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient Full-Domain K-Anonymity. In SIGMOD 05

Generalization algorithms Early systems µ-Argus, Hundpool, Global, bottom-up, greedy Datafly, Sweeney, Global, bottom-up, greedy k-anonymity algorithms AllMin, Samarati, Global, bottom-up, complete, impractical MinGen, Sweeney, Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, LeFevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, LeFevre, 2006 – Local, top-down, partition-based, greedy

Hundpool and Willenborg, 1996 Greedy approach Global generalization with tuple suppression Not guaranteeing k-anonymity µ-Argus

µ-Argus algorithm

µ-Argus

Problems With µ-Argus 1. Only 2- and 3- combinations are examined, there may exist 4 combinations that are unique – may not always satisfy k-anonymity 2. Enforce generalization at the attribute level (global) – may over generalize

The Datafly System Sweeney, 1997 Greedy approach Global generalization with tuple suppression

Datafly Algorithm Core Datafly Algorithm

Datafly MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}

Problems With Datafly 1.Generalizing all values associated with an attribute (global) 2.Suppressing all values within a tuple (global) 3.Selecting the attribute with the greatest number of distinct values as the one to generalize first – computationally efficient but may over generalize

Generalization algorithms Early systems µ-Argus, Hundpool, Global, bottom-up, greedy Datafly, Sweeney, Global, bottom-up, greedy k-anonymity algorithms AllMin, Samarati, Global, bottom-up, complete, impractical MinGen, Sweeney, Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, LeFevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, LeFevre, 2006 – Local, top-down, partition-based, greedy

5/7/ K-OPTIMIZE Practical solution to guarantee optimality Main techniques Framing the problem into a set-enumeration search problem Tree-search strategy with cost-based pruning and dynamic search rearrangement Data management strategies

5/7/ Anonymization Strategies Local suppression Delete individual attribute values E.g. Global attribute generalization Replace specific values with more general ones for an attribute Numeric data: partitioning of the attribute domain into intervals. E.g. Age={[1-10],...,[ ]} Categorical data: generalization hierarchy supplied by users. E.g. Gender = [M or F]

5/7/ K-Anonymization with Suppression K-anonymization with suppression Global attribute generalization with local suppression of outlier tuples. Terminologies Dataset: D Anonymization: {a 1, …, a m } Equivalent classes: E v n,m v 1,n … v 1,m …v 1,1 a1a1 amam E{E{

5/7/ Finding Optimal Anonymization Optimal anonymization determined by a cost metric Cost metrics Discernibility metric: penalty for non- suppressed tuples and suppressed tuples Classification metric

5/7/ Modeling Anonymizations Assume a total order over the set of all attribute domain Set representation for anonymization E.g. Age:, Gender:, Marital Status: {1, 2, 4, 6, 7, 9} -> {2, 7, 9} Power set representation for entire anonymization space Power set of {2, 3, 5, 7, 8, 9} - order of 2 n ! {} – most general anonymization {2,3,5,7,8,9} – most specific anonymization

5/7/ Optimal Anonymization Problem Goal Find the best anonymization in the powerset with lowest cost Algorithm set enumeration search through tree expansion - size 2 n Top-down depth first search Heuristics Cost-based pruning Dynamic tree rearrangement Set enumeration tree over powerset of {1,2,3,4}

5/7/ Node Pruning through Cost Bounding Intuitive idea prune a node H if none of its descendents can be optimal Cost lower-bound of subtree of H Cost of suppressed tuples bounded by H Cost of non-suppressed tuples bounded by A H A

5/7/ Useless Value Pruning Intuitive idea Prune useless values that have no hope of improving cost Useless values Only split equivalence classes into suppressed equivalence classes (size < k)

5/7/ Tree Rearrangement Intuitive idea Dynamically reorder tree to increase pruning opportunities Heuristics sort the values based on the number of equivalence classes induced

5/7/ Experiments Adult census dataset 30k records and 9 attributes Fine: powerset of size Evaluations of performance and optimal cost Comparison with greedy/stochastic method 2-phase greedy generalization/specialization Repeated process

5/7/ Results – Comparison None of the other optimal algorithms can handle the census data Greedy approaches, while executing quickly, produce highly sub- optimal anonymizations Comparison with 2-phase method (greedy + stochastic)

5/7/ Comments Interesting things to think about Domains without hierarchy or total order restrictions Other cost metrics Global generalization vs. local generalization

Generalization algorithms Early systems µ-Argus, Hundpool, Global, bottom-up, greedy Datafly, Sweeney, Global, bottom-up, greedy k-anonymity algorithms AllMin, Samarati, Global, bottom-up, complete, impractical MinGen, Sweeney, Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, LeFevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, LeFevre, 2006 – Local, top-down, partition-based, greedy

Mondrian Top-down partitioning Greedy Local (multidimensional) – tuple/cell level

Global Recoding Mapping domains of quasi-identifiers to generalized or altered values using a single function Notation D x is the domain of attribute X i in table T Single Dimensional φ i : D xi  D’ for each attribute X i of the quasi- id φ i applied to values of X i in tuple of T

Local Recoding Multi-Dimensional Recode domain of value vectors from a set of quasi-identifier attributes φ : D x1 x … x D xn  D’ φ applied to vector of quasi-identifier attributes in each tuple in T

Partitioning Single Dimensional For each X i, define non-overlapping single dimensional intervals that covers D xi Use φ i to map x ε D x to a summary stat Strict Multi-Dimensional Define non-overlapping multi-dimensional intervals that covers D x1 … D xd Use φ to map (x x1 …x xd ) ε D x1 … D xd to a summary stat for its region

Global Recoding Example Multi-Dimensional Single Dimensional Partitions Age : {[25-28]} Sex: {Male, Female} Zip : {[ ], 53712} Partitions {Age: [25-26],Sex: Male, Zip: 53711} {Age: [25-27],Sex: Female, Zip: 53712} {Age: [27-28],Sex: Male, Zip: [ ]} k = 2 Quasi Identifiers Age, Sex, Zipcode

Global Recoding Example 2 k = 2 Quasi Identifiers Age, Zipcode Patient Data Single Dimensional Multi-Dimensional

Greedy Partitioning Algorithm Problem Need an algorithm to find multi-dimensional partitions Optimal k-anonymous strict multi-dimensional partitioning is NP-hard Solution Use a greedy algorithm Based on k-d trees Complexity O(nlogn)

Greedy Partitioning Algorithm

Algorithm Example k = 2 Dimension determined heuristically Quasi-identifiers Zipcode Age Patient Data Anonymized Data

Algorithm Example Iteration # 1 (full table) partition dim = Zipcode splitVal = ` LHSRHS fs

Algorithm Example continued Iteration # 2 (LHS from iteration # 1) partition dim = Age splitVal = 26 LHSRHS fs `

Algorithm Example continued Iteration # 3 (LHS from iteration # 2) partition No Allowable Cut ` Summary: Age = [25-26] Zip= [53711] Iteration # 4 (RHS from iteration # 2) partition No Allowable Cut ` Summary: Age = [27-28] Zip= [ ] `

Algorithm Example continued Iteration # 5 (RHS from iteration # 1) partition No Allowable Cut ` Summary: Age = [25-27] Zip= [53712] `

Experiment Adult dataset Data quality metric (cost metric) Discernability Metric (C DM ) C DM = Σ EquivalentClasses E |E| 2 Assign a penalty to each tuple Normalized Avg. Eqiv. Class Size Metric (C AVG ) C AVG = (total_records/total_equiv_classes)/k

Comparison results Full-domain method: Incognito Single-dimensional method: K-OPTIMIZE

Data partitioning comparison

Mondrian