Protecting Privacy when Disclosing Information Pierangela Samarati Latanya Sweeney.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Discrete Mathematics Lecture 5 Alexander Bukharovich New York University.
Optimal Analyses for 3  n AB Games in the Worst Case Li-Te Huang and Shun-Shii Lin Dept. of Computer Science & Information Engineering, National Taiwan.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
5/16/20151 You Never Escape Your… Relations. 5/16/20152Relations If we want to describe a relationship between elements of two sets A and B, we can use.
1 By Gil Kalai Institute of Mathematics and Center for Rationality, Hebrew University, Jerusalem, Israel presented by: Yair Cymbalista.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
CS Data Structures Chapter 10 Search Structures (Selected Topics)
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
1 Complexity of Network Synchronization Raeda Naamnieh.
1 Dr. Xiao Qin Auburn University Spring, 2011 COMP 7370 Advanced Computer and Network Security Generalizing.
Firewall Policy Queries Author: Alex X. Liu, Mohamed G. Gouda Publisher: IEEE Transaction on Parallel and Distributed Systems 2009 Presenter: Chen-Yu Chang.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Attacks against K-anonymity
Data Flow Analysis Compiler Design Nov. 8, 2005.
Data Flow Analysis Compiler Design Nov. 8, 2005.
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
1 Region-Based Data Flow Analysis. 2 Loops Loops in programs deserve special treatment Because programs spend most of their time executing loops, improving.
Delaunay Triangulations Presented by Glenn Eguchi Computational Geometry October 11, 2001.
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
Set theory Sets: Powerful tool in computer science to solve real world problems. A set is a collection of distinct objects called elements. Traditionally,
Hypothesis Testing.
GRAPH Learning Outcomes Students should be able to:
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Copyright © Cengage Learning. All rights reserved.
April 10, 2002Applied Discrete Mathematics Week 10: Relations 1 Counting Relations Example: How many different reflexive relations can be defined on a.
Protecting Sensitive Labels in Social Network Data Anonymization.
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
CS Data Structures Chapter 10 Search Structures.
Restricted Track Assignment with Applications 報告人:林添進.
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Chapter 5 – Relations and Functions. 5.1Cartesian Products and Relations Definition 5.1: For sets A, B  U, the Cartesian product, or cross product, of.
Discrete Structures1 You Never Escape Your… Relations.
Discrete Mathematics Relation.
Relations and their Properties
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Fall 2002CMSC Discrete Structures1 You Never Escape Your… Relations.
Relations. Important Definitions We covered all of these definitions on the board on Monday, November 7 th. Definition 1 Definition 2 Definition 3 Definition.
Lecture on Relations 1Developed by CSE Dept., CIST Bhopal.
Privacy-preserving data publishing
Some Computation Problems in Coding Theory
Problem Statement How do we represent relationship between two related elements ?
Unit II Discrete Structures Relations and Functions SE (Comp.Engg.)
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Chapter 8: Relations. 8.1 Relations and Their Properties Binary relations: Let A and B be any two sets. A binary relation R from A to B, written R : A.
Relational Database Design Algorithms and Further Dependencies.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
CSE 554 Lecture 8: Alignment
Chapter 5 Relations and Operations
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Parallel Tasks Decomposition
Modeling Arithmetic, Computation, and Languages
Review Graph Directed Graph Undirected Graph Sub-Graph
CS201: Data Structures and Discrete Mathematics I
Chapter 2 Sets and Functions.
Auburn University COMP 7370 Advanced Computer and Network Security The VectorCover Algorithm (2) Dr. Xiao Qin Auburn.
3.3 Applications of Maximum Flow and Minimum Cut
Enumerating Distances Using Spanners of Bounded Degree
Graphs Chapter 11 Objectives Upon completion you will be able to:
On the Graph Decomposition
Presentation transcript:

Protecting Privacy when Disclosing Information Pierangela Samarati Latanya Sweeney

INTRODUCTION Today’s society places demands on person-specific data. more and more historically public information is also electronically available combined, you can identify the personal information This paper addresses the problem of releasing person-specific data while preserving the person's anonymity k-anonymity: Specific information is ambiguously mapped to k- persons

EXAMPLE

RELATED WORK several protection techniques in statistical databases scrambling, adding noise, swapping values etc.. suppression and generalization techniques but no formal foundation Different from traditional access control - protecting the data vs identity of the data

OUTLINE Formal foundation for anonymity problem and against linking quasi-identifiers: attribute that can be exploited for linking k-anonymity: degree of protection of data with respect to inference by linking preferred generalization: allows user to select among possible minimal generalizations - choose attributes Here, they protect the link between the identity and data but not the data itself

DEFINITIONS & ASSUMPTIONS Quasi-identifier: Let T(A1,..,An) be a table. A quasi-identifier is a set of attributes (A1,..,Aj) subset of (A1,..,An) whose release must be controlled. Goal: Allow release of information in the table which is related to atleast a given number k of individuals, k is set by data holder k-anonymity requirement: Each release of the data must be such that every combination of quasi-identifier can be indistinctly matched to atleast k individuals Issue: It is impossible to match the released data to externally available data!!

DEFINITIONS & ASSUMPTIONS Although the data holder knows the external attributes(contributes to quasi-identifiers), the specific values can not be assumed. Key: Translate the requirement in terms of the released data Assumption: All attributes in table PT which are to be released and which are externally available in combination to a data recipient are defined in a quasi-identifier Not a trivial assumption Sweeney examines this risk and shows that this can not be perfectly resolved. k-anonymity for a table: Let T(A1,…,An) be the table and QT be the set of quasi-identifiers of T. T is said to satisfy k-anonymity iff for each QI belongs to QT, each sequence of values in T[QI] appears at least with k occurences in T[QI].

GENERALIZING DATA first approach is based on the definition and use of generalization relationships between domains and between values that attributes can assume. Z0 is the zip code domain and Z1 is the domain where last digit is replaced by 0. to achieve k-anonymity, map the attributes in domain Z0 to Z1 where Z1 is more general This mapping between domains is stated by means of a generalization relationship which represents a partial order ≤ D on the set Dom of domains –each domain Di has at most one direct generalized domain –all maximal elements of Dom are singleton(eventually all domains can be generalized to single value)‏

DOMAIN & VALUE GENERALIZATION HIERARCHIES

DOMAIN GENERALIZATION HIERARCHY Let Dom be the set of domains, given a tuple DT = (D1, …, Dn) such that Di belongs to Dom for i = 1,…,n, DGH DT = DGH D1 x…xDGH Dn, assuming the cartesian product is ordered by imposing coordinate wise order. Each path from DT to unique maximal element of DGH DT in the graph defines a possible alternative path The set of nodes in each such path together with the generalization relationship is called a generalization strategy for DGH DT

GENERALIZED TABLE Tj is a Generalized Table of Ti, written Ti ≤ Tj iff –Ti and Tj have same number of tuples –Domain of each attribute of Tj (denoted by dom(Az,Tj) )is equal to or generalization of the domain of the attribute in Ti and –Each tuple ti in Ti has a corresponding tuple tj in Tj (and vice versa) such that the value for each attribute in tj is equal to or generalization of the value of corresponding attribute in ti. Not all generalized tables are satisfactory Don’t need extreme generalized table if more specific table exists which satisfies k-anonymity k-minimal generalization

Distance vector: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. The distance vector of Tj from Ti is the vector DV i,j = [d1,…,dn] where dz is the length of unique path between dom(Az,Ti) and dom(Az,Tj) in DGHD Given two distance vectors DV = [d1,…,dn] and DV’ = [d1’,…,dn’], DV ≤ DV’ iff di ≤ di’ for all I = 1,…,n; DV < DV’ iff DV ≤ DV’ and DV ≠ DV’. k-minimal generalization: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. Tj is said to be a k-minimal generalization of Ti iff –Tj satisfies k-anonymity –There is no Tz : Ti ≤ Tz, Tz satisfies k-anonymity and DV i,z < DV i,j

EXAMPLE For k=2, GT[1,0] and GT[0,1] are k-minimal generalizations, but not GT[0,2] and GT[1,1] For k=3, GT[1,0] and GT[0,2] are k-minimal generalizations.

SUPPRESSING DATA Complementary approach to generalization Used to moderate the generalization process when there are limited number of tuples(with less than k occurences)‏ Generalized Table with suppression: Ti(A1,…,An) and Tj(A1,…,An) be two tables defined on same attributes. Tj is said to be a generalization of Ti –if sizeof(Tj) ≤ sizeof(Ti) –For all z = 1,…,n : dom(Az,Ti) ≤ dom(Az,Ti) –There is an injective mapping between Ti and Tj that associates tuples ti (in Ti) and tj(in Tj) such that ti[Az] ≤ tj[Az] Minimal Required suppression: Let Tj be a generalization of Ti satisfying k-anonymity, Tj is said to enforce minimal required suppression iff there is no Tz such that Ti ≤ Tz, DV i,z = DV i,j, and sizeof(Tj) < sizeof(Tz) and Tz satisfies k-anonymity.

EXAMPLE The tuples written in bold face and marked with double lines in each table are the tuples that must be suppressed to achieve k-anonymity of 2. Suppression of any superset would not satisfy minimal required suppression.

k-minimal generalization with suppression Generalization and suppression are used in conjunction to obtain k- anonymity Tradeoff between generalization and suppression Acceptable suppression threshold MaxSup Within the threshold, suppression is considered better. Reason: Generalization affects all the tuples whereas Suppression affects single tuple. k-minimal generalization with suppression: Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj and MaxSup be the specific threshold of acceptance suppression. Tj is k-minimal generalization of Ti iff –Tj satisfies k-anonymity –Sizeof(Ti) - Sizeof(Tj) ≤ MaxSup –There is no Tz: Ti ≤ Tz, Tz satisfies conditions 1 and 2 and DV i,z < DV i,j

EXAMPLE

PREFERENCES There may be more than one minimal generalization. Which one to choose? Let Tj be a generalization of Ti with distance vector DV i,j =[d1,…,dn]. –Absdist i,j =∑ i=1 to n di and Reldist i,j =∑ z=1 to n dz/hz where hz is the height of DGH of dom(Az,Ti)‏ Policies: –Minimum absolute distance (smaller total number of generalization steps)‏ –Minimum relative distance (smaller total number of relative steps)‏ –Maximum distribution (greatest number of distinct tuples)‏ –Minimum suppression (contains greater number of tuples))‏ Depends on the application

COMPUTING A PREFERRED GENERALIZATION The generalization is obtained by applying the generalization on each quasi-identifier independently. Local minimal generalization: the generalization that is minimal with respect to the set of generalizations in the strategy. Theorem: Let T(A1,…,An) = PT[QI] be the table to be generalized and let DT=(D1,…,Dn) be the tuple where Dz=dom(Az,T), z=1,…,n, to be a table to be generalized. Every k-minimal generalization of Ti is a local minimal generalization for some strategy of DGH DT From this theorem, each generalization strategy(bottom-up) would reveal local minimal generalization from which k-minimal generalization and an eventual preferred generalization is chosen. If policies are considered, the search has to be extended beyond first result. It might be expensive!

IMPROVEMENT Distance vector between tuples: Let x(v1,…,vn) and y(v1’,…,vn’) belong to T. the distance vector is the vector Vx,y = [d1,…,dn] where di is the length of the paths from v1 and v1’ to their closest common ancestor in VGH. Theorem: Let Ti and Tj be two tables such that Ti ≤ Tj. If Tj is the k- minimal generalization then DVi,j = Vx,y for some tuples x and y in Ti such that either x or y has a smaller number of occurences than k. This implies the distance vector of minimal generalization falls within the set of vectors between outliers and other tuples in the table. This property is exploited by them to prune the number of generalizations considered

ALGORITHM - OUTLINE All the distinct tuples in PT[QI] are determined along with the number of occurences. All the distance vectors between outliers and every tuple in the table is computed. A DAG, as nodes, all the distance vectors found is constructed. There is an arc from each vector to all the smallest vector dominating it in the set. Each path is followed until a local minimal generalization is found. As paths may not be disjoint keep track of visited nodes. After all the paths are examined, k-minimal and preferred generalizations are found.

EXISTANCE Theorem: Let T be a table, MaxSup ≤ sizeof(T) be the acceptable suppression threshold and k be natural number. If sizeof(T) ≥ k then there is atleast one k-minimal generalization for T. If sizeof(T) < K, there are no non-empty k-minimal generalizations for T. Experiments – cost reduction –Computation of distance vectors greatly reduces the cost –Generalizations are not computed but forseen by looking at the tuples. –The fact that the algorithm keeps track of evaluated generalizations allows to stop evaluation whenever it crosses the path that is already visited.