Approximation Algorithms for k-Anonymity

Slides:



Advertisements
Similar presentations
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Advertisements

Weighted Matching-Algorithms, Hamiltonian Cycles and TSP
Approximation algorithms for geometric intersection graphs.
Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
Triangle partition problem Jian Li Sep,2005.  Proposed by Redstar in Algorithm board in Fudan BBS.  Motivated by some network design strategy.
Set Cover 資工碩一 簡裕峰. Set Cover Problem 2.1 (Set Cover) Given a universe U of n elements, a collection of subsets of U, S ={S 1,…,S k }, and a cost.
Clustering.
Greedy Algorithms Greed is good. (Some of the time)
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
GOLOMB RULERS AND GRACEFUL GRAPHS
Combinatorial Algorithms
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Approximation Algorithms
Realizability of Graphs Maria Belk and Robert Connelly.
P-Center & The Power of Graphs A part of the facility location problem set By Kiril Yershov and Alla Segal For Geometric Optimizations course Fall 2010.
An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
1 Vertex Cover Problem Given a graph G=(V, E), find V' ⊆ V such that for each edge (u, v) ∈ E at least one of u and v belongs to V’ and |V’| is minimized.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
A 2-Approximation algorithm for finding an optimum 3-Vertex-Connected Spanning Subgraph.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,
Outline Introduction The hardness result The approximation algorithm.
V. V. Vazirani. Approximation Algorithms Chapters 3 & 22
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Trees and Distance. 2.1 Basic properties Acyclic : a graph with no cycle Forest : acyclic graph Tree : connected acyclic graph Leaf : a vertex of degree.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
1 The Steiner problem with edge length 1 and 2 Author: Marshall Bern and PaulPlassmann Reporter: Chih-Ying Lin ( 林知瑩 ) Source: Information Process Letter.
Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Approximation Algorithms These lecture slides are adapted from CLRS.
Data Structures & Algorithms Graphs
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Approximation Algorithms for TSP Tsvi Kopelowitz 1.
The full Steiner tree problem Theoretical Computer Science 306 (2003) C. L. Lu, C. Y. Tang, R. C. T. Lee Reporter: Cheng-Chung Li 2004/06/28.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
The geometric GMST problem with grid clustering Presented by 楊劭文, 游岳齊, 吳郁君, 林信仲, 萬高維 Department of Computer Science and Information Engineering, National.
Trees.
Timetable Problem solving using Graph Coloring
Approximation Algorithms
P & NP.
Mathematical Foundations of AI
Joint work with Hans Bodlaender
Chapter 5 : Trees.
Optimization problems such as
Haim Kaplan and Uri Zwick
CS4234 Optimiz(s)ation Algorithms
Chapter 5. Optimal Matchings
Computability and Complexity
James B. Orlin Presented by Tal Kaminker
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Approximation Algorithms for TSP
Parameterised Complexity
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
REDUCESEARCH Polynomial Kernels for Hitting Forbidden Minors under Structural Parameterizations Bart M. P. Jansen Astrid Pieterse ESA 2018 August.
On the effect of randomness on planted 3-coloring models
Autumn 2015 Lecture 10 Minimum Spanning Trees
Approximation Algorithms
Fair Clustering through Fairlets ( NIPS 2017)
Richard Anderson Lecture 10 Minimum Spanning Trees
The Full Steiner tree problem Part Two
Clustering.
Minimum Spanning Trees
Presentation transcript:

Approximation Algorithms for k-Anonymity Authors: Gagan Aggarwal Tomas Feder Krishnaram Kenthapadi Rajeev Motwani Rina Panigrahy Dilys Thomas An Zhu Presented by Paul Yelton

Outline Review of k-Anonymity NP-hardness of k-Anonymity with Suppression Algorithm for general k-Anonymity Improved Algorithm for 2-Anonymity Improved Algorithm for 3-Anonymity Conclusion

Review of k-Anonymity

Review of k-Anonymity Suppress/Generalize some entries to ensure that “... there are at least k-1 other tuples in the modified table that are identical to it along the quasi-identifying attributes” Objective is to minimize the extent of suppression and generalization.

Review of k-Anonymity Example of the previous table with anonymized with k=2 Values are suppressed with *s to provide anonymization k must be chosen according to the application to ensure the required level of privacy.

Review of k-Anonymity Uses an input table T that has n rows and m quasi- identifying attributes T is considered to be a table of n m-dimensional vectors x1,...,xn Focus on a special case, k-Anonymity with Suppression, where they only perform suppression Create a k-Anonymous suppression function t mapping xi to xp and replacing the quasi-identifier values with *s. t creates partitions of the n row vectors into clusters of size ≥k

Review of k-Anonymity Formal Definition: k-Anonymity with Suppression Given: x1,x2,...,xn ∈ ∑m and k, create a suppression function t so that t(xi) = t(xp) and xp = * xp ≥ (k-1)xq Minimize cost by using c(t) c(t) is equivalent to the total number of *s in t(xi)

Review of k-Anonymity Next problem deals with k-Anonymity with Generalization, where suppression is also used. Example of an attribute named “Quality”.

Review of k-Anonymity Formal Definition: k-Anonymity with Generalization Given: x1,x2,...,xn ∈ ∑m and k, create a generalization function h to map xi to generalization hierarchy Generalization hierarchy is Djh for all attributes j with 1 ≤ h ≤ lj and Dj0 = Dj h(xi) = h(xj) for at least (k-1) values of j ≠ i c(h) = ∑i∑j h(i,j)/lj Note: k-Anonymity with Suppression is a special case when lj = 1

NP-hardness of k-Anonymity with Suppression

NP-hardness of k-Anonymity with Suppression Present the Proof that was formulated Equivalence to edge partition into triangles and 4-stars Equivalence to edge partition into triangles

NP-hardness of k-Anonymity with Suppression Theorem 1 k-Anonymity with Suppression is NP-hard even for a ternary alphabet, for example ( ∑ = {0,1,2} ) Proof Given a graph G = (V,E) with |E| = 3m for an integer m, can the edges of G be partitioned into m edge-disjoint triangles?

NP-hardness of k-Anonymity with Suppression Construct a table T with 3m rows, For each the y vertices of G create an attribute/column Optimal 3-Anonymity solution for T is ≤ 9m if E can be partitioned into m disjoint triangles. An edge can be defined as follows: e1 = (y1,y2) ex: T= y1 y2 y3 ... yn e1 = (y1,y2) 1 1 e2 = (y2,y3) 1 1 ... ... ... ... ... ... e3m = (y1,yn) 1 1

NP-hardness of k-Anonymity with Suppression Consider a triangle with vertices y1, y2, y3 and apply suppression to those vertices, we obtain a cluster containing 3 rows with *s in each modified row. y1 y2 y3 y1 y2 y3 (y1,y2) 1 1 (y1,y2) * * * (y2,y3) 1 1 (y2,y3) * * * (y3,y1) 1 1 (y3,y1) * * *

NP-hardness of k-Anonymity with Suppression Consider a 4-star with vertices y1, y2, y3, y4 and y4 is the center vertex. Apply suppression to vertices y1, y2, y3, we obtain a cluster containing 3 rows with *s in each modified row. y1 y2 y3 y4 y1 y2 y3 y4 (y1,y4) 1 1 (y1,y4) * * * 1 (y2,y4) 1 1 (y2,y4) * * * 1 (y3,y4) 1 1 (y3,y4) * * * 1 y1 y4 y3 y2

NP-hardness of k-Anonymity with Suppression From the above proofs we obtain an optimal cost of 9m Based on the simple graph G any three rows are distinct and differ in at least three positions There are at least three *s in each modified row so cost ≤ 9m Two possibilities for creating clusters of ≥ size of 3: - Edges form either a triangle or 4-star - Modified rows in a triangle has three *s and 0's elsewhere while modified rows in a 4-star has three *s, single 1 and 0's elsewhere This solution relates to a partition of the edges of a graph into triangles and 4-stars.

NP-hardness of k-Anonymity with Suppression Equivalence to edge partition into triangles Create a table T' as a replication of T so that we can force the 4-stars to pay more *s Use the following function for replicating: t = log2(3m + 1) This allows T' to have t blocks that have n columns. e was defined earlier, e=(a,b) Arbitrary ordering of the edges in E to give a rank to e in binary notation as e1,e2....et Rows have 0's in all places except in the two points. Blocks can be in 1 of 2 configurations: - conf0 has a 1 in position a, and a 2 in position b - conf1 has a 2 in position a, and a 1 in position b

NP-hardness of k-Anonymity with Suppression 1 2 3 4 1 2 3 4 1 2 3 4 (3,4) 1 2 1 2 2 1 (1,4) 1 2 2 1 1 2 (1,2) 1 2 2 1 2 1 (1,3) 1 2 2 1 2 1 (2,3) 1 2 2 1 2 1 Block 1 Block 2 Block 3

NP-hardness of k-Anonymity with Suppression Optimal cost of T' ≤ 9mt only if E can be partitioned into m disjoint triangles Now every triangle in a partition relates to a cluster with 3t *s Proof above proves k-Anonymity is NP-hard with a ternary alphabet for k=3 It can be extended for k=(r2) and r ≥ 3 Replicating the graph for reduction also allows extension when k=α(r2) for any integer α and r ≥ 3

Algorithm for general k-Anonymity

Algorithm for general k-Anonymity Show an O(k)-approximation for the problem of k-Anonymity with Generalization Create a graph - edge-weighted complete graph G = (V,E) - vertex set V contains a vertex to the related vector of k- Anonymized table with generalization - ha,b(j) is lowest level of generalization such that h(a)j = h(b)j - weight function w(e) = ∑jha,b(j)/lj Recall: Attribute j has lj levels of generalization and e refers to an edge where e=(a,b)

Algorithm for general k-Anonymity Limitations of the Graph Representation - Some information about the structure of the problem is lost. - Cannot achieve better than Θ(k) approximation factor Charge of a vertex: is considered to be the total generalization cost of the vector it represents OPT denotes the cost of an optimal k-Anonymity solution, for example from the previous proof, OPT=9m F = { T1,T2,...,Ts }, where F is a spanning forest containing all the vertices Ti is a tree with ≥ k vertices, which is a subgraph of G Weight of Ti is W(Ti)=∑e∈E(Ti)w(e) c(F) = ∑i|V(Ti)|W(Ti) L is size of largest component

Algorithm for general k-Anonymity Outline of an Algorithm 1. Create a forest G with a cost ≤ OPT 2. Calculate decomposition of the forest allowing deletion so that k ≤ |V(Ti)| ≤ max{ 2k–1, 3k–5 } vertices

Algorithm for general k-Anonymity This algorithm yields k ≤ c(F) ≤ OPT

Algorithm for general k-Anonymity Algorithm below breaks components > max{ 2k-1, 3k-5 } into components of size at least k.

Algorithm for general k-Anonymity

Algorithm for general k-Anonymity Theorem 5 There is a polynomial-time algorithm that achieves an approximation ratio of max{ 2k–1, 3k–5 } Proof Create a forest with the above FOREST algorithm, then repeatedly apply the DECOMPOSE-COMPONENT to any component > max{ 2k–1, 3k–5 } Note: Both algorithms terminate in O(kn2) time.

Algorithm for general k-Anonymity This algorithm can be used when attributes are assigned weights and minimize the weighted generalization cost is desired It also can be extended to allow an entire row deletion instead of forcing it to pair with k-1 other rows - the distance between any two vertices is no more than the cost of deleting a vertex

Improved Algorithm for 2-Anonymity

Improved Algorithm for 2-Anonymity The previous section shows a 3-approximation algorithm for this case, but they improve upon this result and produce a polynomial-time 1.5 approximation Use a minimum-weight [1,2]-factor of a graph, meaning each vertex in a subgraph has a degree of 1 or 2. F is a subgraph of G W(F) = ∑Fw(e) F is a vertex-disjoint collection of edges and pairs of adjacent edges Each component of F is treated as a cluster, meaning displaying the bits on which all vectors agree and replace all other bits with *s

Improved Algorithm for 2-Anonymity Theorem 6 Number of *s introduced by the above algorithm is at most 1.5 times the number of *s in an optimal 2-Anonymity solution Observation 7 If vertices x1,x2,x3 form a cluster in a k-Anonymity solution, the number of *s = p + q + r=½(α + β + γ) xmed is the median vertex, number of *s in each modified vector is at least p + q + r cOFAC is the weight of an optimal [1,2]-factor cALG is the cost of 2-Anonymity solution cFAC is the weight of the [1,2]-factor

Improved Algorithm for 2-Anonymity Lemma cALG ≤ 3 ∙ cOFAC Proof For a cluster of size 3, the number of *s in each row is (α + β + γ)/2 Total number of *s = 3/2(α + β + γ) ≤ 3(α + β) obtained by triangle inequality Optimal [1,2]-factor contains the 2 lighter edges of the triangle that has a cost of (α + β) for this cluster ∑cluster = cALG ≤ 3 ∙ cOFAC

Improved Algorithm for 2-Anonymity Lemma cFAC ≤ ½OPT Proof -Cluster of size 2 cost of [1,2]-factor FAC = ½OPT -Cluster of size 3 cost of FAC = α + β≤⅔(α + β + γ)=4/3(p+q+r), inequality is obtained by the fact of γ ≥ α,β -Cost of OPT=3(p+q+r) and from above the cost of FAC is at most ½OPT -∑cluster = cFAC ≤ ½OPT

Improved Algorithm for 3-Anonymity

Improved Algorithm for 3-Anonymity Produce a polynomial-time 2 approximation This idea is very similar to the algorithm above for 1.5 approximation Lemma Cost of the optimal 2-factor, cOFAC on graph G corresponding to the vectors in the 3-Anonymity instance is at most ⅔OPT, cOFAC ≤ ⅔OPT Proof - Clusters are of size 3,4 or 5 vertices and if clusters > 5, they can be broken down into smaller groups of at least 3 - For every cluster pick the min-weight cycle involving the vertices of the cluster

Improved Algorithm for 3-Anonymity Consider the following 3 cases: - Cluster i = size 3, then a triangle is present - a,b, and c are the lengths of the edges. - total cost of OPT is OPTi = 3/2(a + b + c) - FAC has a total cost of cFAC,i = a+b+c=3/2OPTi - Cluster i = size 4, τ = sum of the weights of all (42)=6 edges - FAC pays cFAC,i ≤ ⅔τ - OPTi ≥ 4 ∙½ ∙ 2/4 ∙ τ = τ - cFAC,i ≤ ⅔OPTi - Cluster i = size 5, τ = sum of the weights of all (52)=10 edges - similar to i=4, FAC pays cFAC,i ≤ 5/10τ - OPTi ≥ 5 ∙½ ∙ 3/10 ∙ τ = 3/4τ

Improved Algorithm for 3-Anonymity Lemma Given a 2-factor F with cost cF, we achieve a solution for 3- Anonymity with a cost cALG ≤ 3 ∙ cF Proof - Every cycle in F with size 3,4 and 5 will become a cluster - Depending on the size of the cycle ALG pays as follows - For a triangle, ALG pays 3∙½len(C) ≤ 3∙len(C) - For a 4-cycle, ALG pays at most 4∙½len(C) ≤ 3∙len(C) - For a 5-cycle, ALG pays at most 5∙½len(C) ≤ 3∙len(C) - For (3x+1)-cycle, ALG pays at most 6(x-1)+12/3x+1∙len(C) ≤ 3∙len(C) - For (3x+2)-cycle, ALG pays at most 6(x-2)+24/3x+2∙len(C) ≤ 3∙len(C)

Conclusion Demonstrated that k-Anonymity with Generalization is NP-hard even with ternary values and only suppression is allowed Gave an O(k)-approximation algorithm for an arbitrary value of k and alphabet size Showed improved approximations for k=2 (1.5) and k=3 (2) It is not possible to achieve an approximation factor better than k/4 by using graph representation Interesting to see the hardness of approximation for k- Anonymity without using graph representation Useful to extend k-Anonymity framework to handle inserts, deletes and updates to database.