Approximation Algorithms for k-Anonymity

Approximation Algorithms for k-Anonymity
Authors: Gagan Aggarwal Tomas Feder Krishnaram Kenthapadi Rajeev Motwani Rina Panigrahy Dilys Thomas An Zhu Presented by Paul Yelton

Outline Review of k-Anonymity NP-hardness of k-Anonymity with Suppression Algorithm for general k-Anonymity Improved Algorithm for 2-Anonymity Improved Algorithm for 3-Anonymity Conclusion

Review of k-Anonymity

Review of k-Anonymity Suppress/Generalize some entries to ensure that “... there are at least k-1 other tuples in the modified table that are identical to it along the quasi-identifying attributes” Objective is to minimize the extent of suppression and generalization.

Review of k-Anonymity Example of the previous table with anonymized with k=2 Values are suppressed with *s to provide anonymization k must be chosen according to the application to ensure the required level of privacy.

Review of k-Anonymity Uses an input table T that has n rows and m quasi- identifying attributes T is considered to be a table of n m-dimensional vectors x1,...,xn Focus on a special case, k-Anonymity with Suppression, where they only perform suppression Create a k-Anonymous suppression function t mapping xi to xp and replacing the quasi-identifier values with *s. t creates partitions of the n row vectors into clusters of size ≥k

Review of k-Anonymity Formal Definition: k-Anonymity with Suppression
Given: x1,x2,...,xn ∈ ∑m and k, create a suppression function t so that t(xi) = t(xp) and xp = * xp ≥ (k-1)xq Minimize cost by using c(t) c(t) is equivalent to the total number of *s in t(xi)

Review of k-Anonymity Next problem deals with k-Anonymity with Generalization, where suppression is also used. Example of an attribute named “Quality”.

Review of k-Anonymity Formal Definition:
k-Anonymity with Generalization Given: x1,x2,...,xn ∈ ∑m and k, create a generalization function h to map xi to generalization hierarchy Generalization hierarchy is Djh for all attributes j with ≤ h ≤ lj and Dj0 = Dj h(xi) = h(xj) for at least (k-1) values of j ≠ i c(h) = ∑i∑j h(i,j)/lj Note: k-Anonymity with Suppression is a special case when lj = 1

NP-hardness of k-Anonymity with Suppression

Present the Proof that was formulated Equivalence to edge partition into triangles and 4-stars Equivalence to edge partition into triangles

Theorem 1 k-Anonymity with Suppression is NP-hard even for a ternary alphabet, for example ( ∑ = {0,1,2} ) Proof Given a graph G = (V,E) with |E| = 3m for an integer m, can the edges of G be partitioned into m edge-disjoint triangles?

Construct a table T with 3m rows, For each the y vertices of G create an attribute/column Optimal 3-Anonymity solution for T is ≤ 9m if E can be partitioned into m disjoint triangles. An edge can be defined as follows: e1 = (y1,y2) ex: T= y1 y2 y3 ... yn e1 = (y1,y2) 1 1 e2 = (y2,y3) 1 1 ... ... ... ... ... ... e3m = (y1,yn) 1 1

Consider a triangle with vertices y1, y2, y3 and apply suppression to those vertices, we obtain a cluster containing 3 rows with *s in each modified row. y1 y2 y3 y1 y2 y3 (y1,y2) 1 1 (y1,y2) * * * (y2,y3) 1 1 (y2,y3) * * * (y3,y1) 1 1 (y3,y1) * * *

Consider a 4-star with vertices y1, y2, y3, y4 and y4 is the center vertex. Apply suppression to vertices y1, y2, y3, we obtain a cluster containing 3 rows with *s in each modified row. y1 y2 y3 y4 y1 y2 y3 y4 (y1,y4) 1 1 (y1,y4) * * * 1 (y2,y4) 1 1 (y2,y4) * * * 1 (y3,y4) 1 1 (y3,y4) * * * 1 y1 y4 y3 y2

From the above proofs we obtain an optimal cost of 9m Based on the simple graph G any three rows are distinct and differ in at least three positions There are at least three *s in each modified row so cost ≤ 9m Two possibilities for creating clusters of ≥ size of 3: - Edges form either a triangle or 4-star - Modified rows in a triangle has three *s and 0's elsewhere while modified rows in a 4-star has three *s, single 1 and 0's elsewhere This solution relates to a partition of the edges of a graph into triangles and 4-stars.

Equivalence to edge partition into triangles Create a table T' as a replication of T so that we can force the 4-stars to pay more *s Use the following function for replicating: t = log2(3m + 1) This allows T' to have t blocks that have n columns. e was defined earlier, e=(a,b) Arbitrary ordering of the edges in E to give a rank to e in binary notation as e1,e2....et Rows have 0's in all places except in the two points. Blocks can be in 1 of 2 configurations: - conf0 has a 1 in position a, and a 2 in position b - conf1 has a 2 in position a, and a 1 in position b

1 2 3 4 1 2 3 4 1 2 3 4 (3,4) 1 2 1 2 2 1 (1,4) 1 2 2 1 1 2 (1,2) 1 2 2 1 2 1 (1,3) 1 2 2 1 2 1 (2,3) 1 2 2 1 2 1 Block 1 Block 2 Block 3

Optimal cost of T' ≤ 9mt only if E can be partitioned into m disjoint triangles Now every triangle in a partition relates to a cluster with 3t *s Proof above proves k-Anonymity is NP-hard with a ternary alphabet for k=3 It can be extended for k=(r2) and r ≥ 3 Replicating the graph for reduction also allows extension when k=α(r2) for any integer α and r ≥ 3

Algorithm for general k-Anonymity

Show an O(k)-approximation for the problem of k-Anonymity with Generalization Create a graph - edge-weighted complete graph G = (V,E) - vertex set V contains a vertex to the related vector of k- Anonymized table with generalization - ha,b(j) is lowest level of generalization such that h(a)j = h(b)j - weight function w(e) = ∑jha,b(j)/lj Recall: Attribute j has lj levels of generalization and e refers to an edge where e=(a,b)

Limitations of the Graph Representation - Some information about the structure of the problem is lost. - Cannot achieve better than Θ(k) approximation factor Charge of a vertex: is considered to be the total generalization cost of the vector it represents OPT denotes the cost of an optimal k-Anonymity solution, for example from the previous proof, OPT=9m F = { T1,T2,...,Ts }, where F is a spanning forest containing all the vertices Ti is a tree with ≥ k vertices, which is a subgraph of G Weight of Ti is W(Ti)=∑e∈E(Ti)w(e) c(F) = ∑i|V(Ti)|W(Ti) L is size of largest component

Outline of an Algorithm 1. Create a forest G with a cost ≤ OPT 2. Calculate decomposition of the forest allowing deletion so that k ≤ |V(Ti)| ≤ max{ 2k–1, 3k–5 } vertices

This algorithm yields k ≤ c(F) ≤ OPT

Algorithm below breaks components > max{ 2k-1, 3k-5 } into components of size at least k.

Theorem 5 There is a polynomial-time algorithm that achieves an approximation ratio of max{ 2k–1, 3k–5 } Proof Create a forest with the above FOREST algorithm, then repeatedly apply the DECOMPOSE-COMPONENT to any component > max{ 2k–1, 3k–5 } Note: Both algorithms terminate in O(kn2) time.

This algorithm can be used when attributes are assigned weights and minimize the weighted generalization cost is desired It also can be extended to allow an entire row deletion instead of forcing it to pair with k-1 other rows - the distance between any two vertices is no more than the cost of deleting a vertex

Improved Algorithm for 2-Anonymity

The previous section shows a 3-approximation algorithm for this case, but they improve upon this result and produce a polynomial-time 1.5 approximation Use a minimum-weight [1,2]-factor of a graph, meaning each vertex in a subgraph has a degree of 1 or 2. F is a subgraph of G W(F) = ∑Fw(e) F is a vertex-disjoint collection of edges and pairs of adjacent edges Each component of F is treated as a cluster, meaning displaying the bits on which all vectors agree and replace all other bits with *s

Theorem 6 Number of *s introduced by the above algorithm is at most 1.5 times the number of *s in an optimal 2-Anonymity solution Observation 7 If vertices x1,x2,x3 form a cluster in a k-Anonymity solution, the number of *s = p + q + r=½(α + β + γ) xmed is the median vertex, number of *s in each modified vector is at least p + q + r cOFAC is the weight of an optimal [1,2]-factor cALG is the cost of 2-Anonymity solution cFAC is the weight of the [1,2]-factor

Lemma cALG ≤ 3 ∙ cOFAC Proof For a cluster of size 3, the number of *s in each row is (α + β + γ)/2 Total number of *s = 3/2(α + β + γ) ≤ 3(α + β) obtained by triangle inequality Optimal [1,2]-factor contains the 2 lighter edges of the triangle that has a cost of (α + β) for this cluster ∑cluster = cALG ≤ 3 ∙ cOFAC

Lemma cFAC ≤ ½OPT Proof -Cluster of size 2 cost of [1,2]-factor FAC = ½OPT -Cluster of size 3 cost of FAC = α + β≤⅔(α + β + γ)=4/3(p+q+r), inequality is obtained by the fact of γ ≥ α,β -Cost of OPT=3(p+q+r) and from above the cost of FAC is at most ½OPT -∑cluster = cFAC ≤ ½OPT

Produce a polynomial-time 2 approximation This idea is very similar to the algorithm above for 1.5 approximation Lemma Cost of the optimal 2-factor, cOFAC on graph G corresponding to the vectors in the 3-Anonymity instance is at most ⅔OPT, cOFAC ≤ ⅔OPT Proof - Clusters are of size 3,4 or 5 vertices and if clusters > 5, they can be broken down into smaller groups of at least 3 - For every cluster pick the min-weight cycle involving the vertices of the cluster

Consider the following 3 cases: - Cluster i = size 3, then a triangle is present - a,b, and c are the lengths of the edges. - total cost of OPT is OPTi = 3/2(a + b + c) - FAC has a total cost of cFAC,i = a+b+c=3/2OPTi - Cluster i = size 4, τ = sum of the weights of all (42)=6 edges - FAC pays cFAC,i ≤ ⅔τ - OPTi ≥ 4 ∙½ ∙ 2/4 ∙ τ = τ - cFAC,i ≤ ⅔OPTi - Cluster i = size 5, τ = sum of the weights of all (52)=10 edges - similar to i=4, FAC pays cFAC,i ≤ 5/10τ - OPTi ≥ 5 ∙½ ∙ 3/10 ∙ τ = 3/4τ

Lemma Given a 2-factor F with cost cF, we achieve a solution for 3- Anonymity with a cost cALG ≤ 3 ∙ cF Proof - Every cycle in F with size 3,4 and 5 will become a cluster - Depending on the size of the cycle ALG pays as follows - For a triangle, ALG pays 3∙½len(C) ≤ 3∙len(C) - For a 4-cycle, ALG pays at most 4∙½len(C) ≤ 3∙len(C) - For a 5-cycle, ALG pays at most 5∙½len(C) ≤ 3∙len(C) - For (3x+1)-cycle, ALG pays at most 6(x-1)+12/3x+1∙len(C) ≤ 3∙len(C) - For (3x+2)-cycle, ALG pays at most 6(x-2)+24/3x+2∙len(C) ≤ 3∙len(C)

Conclusion Demonstrated that k-Anonymity with Generalization is NP-hard even with ternary values and only suppression is allowed Gave an O(k)-approximation algorithm for an arbitrary value of k and alphabet size Showed improved approximations for k=2 (1.5) and k=3 (2) It is not possible to achieve an approximation factor better than k/4 by using graph representation Interesting to see the hardness of approximation for k- Anonymity without using graph representation Useful to extend k-Anonymity framework to handle inserts, deletes and updates to database.

Approximation Algorithms for k-Anonymity

Similar presentations

Presentation on theme: "Approximation Algorithms for k-Anonymity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximation Algorithms for k-Anonymity

Similar presentations

Presentation on theme: "Approximation Algorithms for k-Anonymity"— Presentation transcript:

Similar presentations

About project

Feedback