Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,

Slides:



Advertisements
Similar presentations
Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
Advertisements

1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.
1 Steiner Tree on graphs of small treewidth Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.
1 Minimizing Movement Erik D. Demaine, MohammadTaghi Hajiagahayi, Hamid Mahini, Amin S. Sayedi-Roshkhar, Shayan Oveisgharan, Morteza Zadimoghaddam SODA.
Chapter 3 The Greedy Method 3.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Minimum Spanning Tree Partitioning Algorithm for Microaggregation
Lecture 21: Spectral Clustering
Balanced Graph Partitioning Konstantin Andreev Harald Räcke.
An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Polynomial Time Approximation Scheme for Euclidian Traveling Salesman
1 Computing Nash Equilibrium Presenter: Yishay Mansour.
Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.
1 Vertex Cover Problem Given a graph G=(V, E), find V' ⊆ V such that for each edge (u, v) ∈ E at least one of u and v belongs to V’ and |V’| is minimized.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 11 Instructor: Paul Beame.
(work appeared in SODA 10’) Yuk Hei Chan (Tom)
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Approximation Algorithms Ola Svensson. Course Information Goal: – Learn the techniques used by studying famous applications Graduate Course FDD
Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.
Preserving Privacy in Clickstreams Isabelle Stanton.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms: Minimum.
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Internet Traffic Engineering by Optimizing OSPF Weights Bernard Fortz (Universit é Libre de Bruxelles) Mikkel Thorup (AT&T Labs-Research) Presented by.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
EMIS 8374 Optimal Trees updated 25 April slide 1 Minimum Spanning Tree (MST) Input –A (simple) graph G = (V,E) –Edge cost c ij for each edge e 
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Privacy-preserving data publishing
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Chapter 13 Backtracking Introduction The 3-coloring problem
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
1 The instructor will be absent on March 29 th. The class resumes on March 31 st.
The geometric GMST problem with grid clustering Presented by 楊劭文, 游岳齊, 吳郁君, 林信仲, 萬高維 Department of Computer Science and Information Engineering, National.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Certifying Algorithms [MNS11]R.M. McConnell, K. Mehlhorn, S. Näher, P. Schweitzer. Certifying algorithms. Computer Science Review, 5(2), , 2011.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Algorithms for Finding Distance-Edge-Colorings of Graphs
New Characterizations in Turnstile Streams with Applications
Minimum Spanning Tree Chapter 13.6.
Approximating the MST Weight in Sublinear Time
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture 7: Dynamic sampling Dimension Reduction
CIS 700: “algorithms for Big Data”
Dynamic and Online Algorithms for Set Cover
Sublinear Algorihms for Big Data
Coverage Approximation Algorithms
CSCI B609: “Foundations of Data Science”
Approximation Algorithms for k-Anonymity
Clustering The process of grouping samples so that the samples are similar within each group.
Distance-preserving Subgraphs of Interval Graphs
Presentation transcript:

Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, Dilys Thomas, An Zhu

Krishnaram KenthapadiPORTIA Workshop, 8 July An example: Medical Records IdentifyingSensitive SSNNameAgeRaceZipcod e Disease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 614Rob46Hisp94042Arthritis

Krishnaram KenthapadiPORTIA Workshop, 8 July Medical Records: De-identify & Release Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Krishnaram KenthapadiPORTIA Workshop, 8 July Not sufficient! [Swe02, SS98] Public Database Uniquely identify you! Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Quasi-identifiers: reveal less information k-anonymity model

Krishnaram KenthapadiPORTIA Workshop, 8 July k-anonymity – Problem Definition  Input: Database consisting of n rows, each with m attributes drawn from a finite alphabet.  Goal: Suppress some entries in the table such that each modified row becomes identical to at least k-1 other rows.  More the suppression, lesser the utility of the modified table.  Objective: Minimize the number of suppressed entries.

Krishnaram KenthapadiPORTIA Workshop, 8 July Medical Records: 2-anonymized table AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis Suppress entriesCost = 10

Krishnaram KenthapadiPORTIA Workshop, 8 July k-anonymity – Results  [MW04]  NP-hardness for a linear size alphabet  O(k log k) - approximation algorithm  NP-hardness (even for ternary alphabet)  O(k) - approximation for k-anonymity  approximation for 2-anonymity  2 - approximation for 3-anonymity

Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3)  Create a complete graph s.t.  Each row vector in the table is a vertex.  Weight of an edge is the number of attributes on which the two rows differ (Hamming distance). AgeRaceZipcod e 31Cauc Cauc Afr-A Afr-A

Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3)  We create a forest as follows:  Each node picks its nearest neighbor and connects to it.  If the resulting graph has a component with only two nodes, connect this component to the second nearest neighbor of one of the two nodes.

Krishnaram KenthapadiPORTIA Workshop, 8 July An example graph Nearest-neighbor edge Other edges 7

Krishnaram KenthapadiPORTIA Workshop, 8 July The forest obtained

Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3)  The forest has:  Components of size at least 3.  The total cost of edges in the forest is no more than the cost of the optimal solution.  In optimal solution, each node has at least as many *s as its Hamming distance to its second nearest neighbor.  Each node has at most as many *s as the cost of the tree containing the node.  If there is any component with size greater than 5, break it into components of size at least 3 (resp. k).

Krishnaram KenthapadiPORTIA Workshop, 8 July The final partition

Krishnaram KenthapadiPORTIA Workshop, 8 July Analysis of the algorithm  Cluster the row vectors according to this partition  Cost incurred ≤ OPT * (size of largest partition) = 5 * OPT.  For general k, the cost of this solution is within max{3k-5,2k-1} of the cost of optimal solution.

Krishnaram KenthapadiPORTIA Workshop, 8 July Better than O(k)-approximation?  Not possible, using only the graph representation  Lose information about the structure of the problem  There exist two instances with:  Same underlying graph  k-anonymity costs differing by a factor of O(k)

Krishnaram KenthapadiPORTIA Workshop, 8 July Open problems  Lower bounds on the approximation factor (without assuming the graph representation)  Extend the k-anonymity model to account for changes in the database:  Handle inserts, deletes and updates