Download presentation
Presentation is loading. Please wait.
1
Protecting Privacy when Disclosing Information Pierangela Samarati Latanya Sweeney
2
INTRODUCTION Today’s society places demands on person-specific data. more and more historically public information is also electronically available combined, you can identify the personal information This paper addresses the problem of releasing person-specific data while preserving the person's anonymity k-anonymity: Specific information is ambiguously mapped to k- persons
3
EXAMPLE
4
RELATED WORK several protection techniques in statistical databases scrambling, adding noise, swapping values etc.. suppression and generalization techniques but no formal foundation Different from traditional access control - protecting the data vs identity of the data
5
OUTLINE Formal foundation for anonymity problem and against linking quasi-identifiers: attribute that can be exploited for linking k-anonymity: degree of protection of data with respect to inference by linking preferred generalization: allows user to select among possible minimal generalizations - choose attributes Here, they protect the link between the identity and data but not the data itself
6
DEFINITIONS & ASSUMPTIONS Quasi-identifier: Let T(A1,..,An) be a table. A quasi-identifier is a set of attributes (A1,..,Aj) subset of (A1,..,An) whose release must be controlled. Goal: Allow release of information in the table which is related to atleast a given number k of individuals, k is set by data holder k-anonymity requirement: Each release of the data must be such that every combination of quasi-identifier can be indistinctly matched to atleast k individuals Issue: It is impossible to match the released data to externally available data!!
7
DEFINITIONS & ASSUMPTIONS Although the data holder knows the external attributes(contributes to quasi-identifiers), the specific values can not be assumed. Key: Translate the requirement in terms of the released data Assumption: All attributes in table PT which are to be released and which are externally available in combination to a data recipient are defined in a quasi-identifier Not a trivial assumption Sweeney examines this risk and shows that this can not be perfectly resolved. k-anonymity for a table: Let T(A1,…,An) be the table and QT be the set of quasi-identifiers of T. T is said to satisfy k-anonymity iff for each QI belongs to QT, each sequence of values in T[QI] appears at least with k occurences in T[QI].
8
GENERALIZING DATA first approach is based on the definition and use of generalization relationships between domains and between values that attributes can assume. Z0 is the zip code domain and Z1 is the domain where last digit is replaced by 0. to achieve k-anonymity, map the attributes in domain Z0 to Z1 where Z1 is more general This mapping between domains is stated by means of a generalization relationship which represents a partial order ≤ D on the set Dom of domains –each domain Di has at most one direct generalized domain –all maximal elements of Dom are singleton(eventually all domains can be generalized to single value)
9
DOMAIN & VALUE GENERALIZATION HIERARCHIES
10
DOMAIN GENERALIZATION HIERARCHY Let Dom be the set of domains, given a tuple DT = (D1, …, Dn) such that Di belongs to Dom for i = 1,…,n, DGH DT = DGH D1 x…xDGH Dn, assuming the cartesian product is ordered by imposing coordinate wise order. Each path from DT to unique maximal element of DGH DT in the graph defines a possible alternative path The set of nodes in each such path together with the generalization relationship is called a generalization strategy for DGH DT
11
GENERALIZED TABLE Tj is a Generalized Table of Ti, written Ti ≤ Tj iff –Ti and Tj have same number of tuples –Domain of each attribute of Tj (denoted by dom(Az,Tj) )is equal to or generalization of the domain of the attribute in Ti and –Each tuple ti in Ti has a corresponding tuple tj in Tj (and vice versa) such that the value for each attribute in tj is equal to or generalization of the value of corresponding attribute in ti. Not all generalized tables are satisfactory Don’t need extreme generalized table if more specific table exists which satisfies k-anonymity k-minimal generalization
12
Distance vector: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. The distance vector of Tj from Ti is the vector DV i,j = [d1,…,dn] where dz is the length of unique path between dom(Az,Ti) and dom(Az,Tj) in DGHD Given two distance vectors DV = [d1,…,dn] and DV’ = [d1’,…,dn’], DV ≤ DV’ iff di ≤ di’ for all I = 1,…,n; DV < DV’ iff DV ≤ DV’ and DV ≠ DV’. k-minimal generalization: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. Tj is said to be a k-minimal generalization of Ti iff –Tj satisfies k-anonymity –There is no Tz : Ti ≤ Tz, Tz satisfies k-anonymity and DV i,z < DV i,j
13
EXAMPLE For k=2, GT[1,0] and GT[0,1] are k-minimal generalizations, but not GT[0,2] and GT[1,1] For k=3, GT[1,0] and GT[0,2] are k-minimal generalizations.
14
SUPPRESSING DATA Complementary approach to generalization Used to moderate the generalization process when there are limited number of tuples(with less than k occurences) Generalized Table with suppression: Ti(A1,…,An) and Tj(A1,…,An) be two tables defined on same attributes. Tj is said to be a generalization of Ti –if sizeof(Tj) ≤ sizeof(Ti) –For all z = 1,…,n : dom(Az,Ti) ≤ dom(Az,Ti) –There is an injective mapping between Ti and Tj that associates tuples ti (in Ti) and tj(in Tj) such that ti[Az] ≤ tj[Az] Minimal Required suppression: Let Tj be a generalization of Ti satisfying k-anonymity, Tj is said to enforce minimal required suppression iff there is no Tz such that Ti ≤ Tz, DV i,z = DV i,j, and sizeof(Tj) < sizeof(Tz) and Tz satisfies k-anonymity.
15
EXAMPLE The tuples written in bold face and marked with double lines in each table are the tuples that must be suppressed to achieve k-anonymity of 2. Suppression of any superset would not satisfy minimal required suppression.
16
k-minimal generalization with suppression Generalization and suppression are used in conjunction to obtain k- anonymity Tradeoff between generalization and suppression Acceptable suppression threshold MaxSup Within the threshold, suppression is considered better. Reason: Generalization affects all the tuples whereas Suppression affects single tuple. k-minimal generalization with suppression: Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj and MaxSup be the specific threshold of acceptance suppression. Tj is k-minimal generalization of Ti iff –Tj satisfies k-anonymity –Sizeof(Ti) - Sizeof(Tj) ≤ MaxSup –There is no Tz: Ti ≤ Tz, Tz satisfies conditions 1 and 2 and DV i,z < DV i,j
17
EXAMPLE
18
PREFERENCES There may be more than one minimal generalization. Which one to choose? Let Tj be a generalization of Ti with distance vector DV i,j =[d1,…,dn]. –Absdist i,j =∑ i=1 to n di and Reldist i,j =∑ z=1 to n dz/hz where hz is the height of DGH of dom(Az,Ti) Policies: –Minimum absolute distance (smaller total number of generalization steps) –Minimum relative distance (smaller total number of relative steps) –Maximum distribution (greatest number of distinct tuples) –Minimum suppression (contains greater number of tuples)) Depends on the application
19
COMPUTING A PREFERRED GENERALIZATION The generalization is obtained by applying the generalization on each quasi-identifier independently. Local minimal generalization: the generalization that is minimal with respect to the set of generalizations in the strategy. Theorem: Let T(A1,…,An) = PT[QI] be the table to be generalized and let DT=(D1,…,Dn) be the tuple where Dz=dom(Az,T), z=1,…,n, to be a table to be generalized. Every k-minimal generalization of Ti is a local minimal generalization for some strategy of DGH DT From this theorem, each generalization strategy(bottom-up) would reveal local minimal generalization from which k-minimal generalization and an eventual preferred generalization is chosen. If policies are considered, the search has to be extended beyond first result. It might be expensive!
20
IMPROVEMENT Distance vector between tuples: Let x(v1,…,vn) and y(v1’,…,vn’) belong to T. the distance vector is the vector Vx,y = [d1,…,dn] where di is the length of the paths from v1 and v1’ to their closest common ancestor in VGH. Theorem: Let Ti and Tj be two tables such that Ti ≤ Tj. If Tj is the k- minimal generalization then DVi,j = Vx,y for some tuples x and y in Ti such that either x or y has a smaller number of occurences than k. This implies the distance vector of minimal generalization falls within the set of vectors between outliers and other tuples in the table. This property is exploited by them to prune the number of generalizations considered
21
ALGORITHM - OUTLINE All the distinct tuples in PT[QI] are determined along with the number of occurences. All the distance vectors between outliers and every tuple in the table is computed. A DAG, as nodes, all the distance vectors found is constructed. There is an arc from each vector to all the smallest vector dominating it in the set. Each path is followed until a local minimal generalization is found. As paths may not be disjoint keep track of visited nodes. After all the paths are examined, k-minimal and preferred generalizations are found.
22
EXISTANCE Theorem: Let T be a table, MaxSup ≤ sizeof(T) be the acceptable suppression threshold and k be natural number. If sizeof(T) ≥ k then there is atleast one k-minimal generalization for T. If sizeof(T) < K, there are no non-empty k-minimal generalizations for T. Experiments – cost reduction –Computation of distance vectors greatly reduces the cost –Generalizations are not computed but forseen by looking at the tuples. –The fact that the algorithm keeps track of evaluated generalizations allows to stop evaluation whenever it crosses the path that is already visited.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.