Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, Dilys Thomas, An Zhu
Krishnaram KenthapadiPORTIA Workshop, 8 July An example: Medical Records IdentifyingSensitive SSNNameAgeRaceZipcod e Disease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 614Rob46Hisp94042Arthritis
Krishnaram KenthapadiPORTIA Workshop, 8 July Medical Records: De-identify & Release Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
Krishnaram KenthapadiPORTIA Workshop, 8 July Not sufficient! [Swe02, SS98] Public Database Uniquely identify you! Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Quasi-identifiers: reveal less information k-anonymity model
Krishnaram KenthapadiPORTIA Workshop, 8 July k-anonymity – Problem Definition Input: Database consisting of n rows, each with m attributes drawn from a finite alphabet. Goal: Suppress some entries in the table such that each modified row becomes identical to at least k-1 other rows. More the suppression, lesser the utility of the modified table. Objective: Minimize the number of suppressed entries.
Krishnaram KenthapadiPORTIA Workshop, 8 July Medical Records: 2-anonymized table AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis Suppress entriesCost = 10
Krishnaram KenthapadiPORTIA Workshop, 8 July k-anonymity – Results [MW04] NP-hardness for a linear size alphabet O(k log k) - approximation algorithm NP-hardness (even for ternary alphabet) O(k) - approximation for k-anonymity approximation for 2-anonymity 2 - approximation for 3-anonymity
Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3) Create a complete graph s.t. Each row vector in the table is a vertex. Weight of an edge is the number of attributes on which the two rows differ (Hamming distance). AgeRaceZipcod e 31Cauc Cauc Afr-A Afr-A
Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3) We create a forest as follows: Each node picks its nearest neighbor and connects to it. If the resulting graph has a component with only two nodes, connect this component to the second nearest neighbor of one of the two nodes.
Krishnaram KenthapadiPORTIA Workshop, 8 July An example graph Nearest-neighbor edge Other edges 7
Krishnaram KenthapadiPORTIA Workshop, 8 July The forest obtained
Krishnaram KenthapadiPORTIA Workshop, 8 July O(k)-approximation algorithm (for k = 3) The forest has: Components of size at least 3. The total cost of edges in the forest is no more than the cost of the optimal solution. In optimal solution, each node has at least as many *s as its Hamming distance to its second nearest neighbor. Each node has at most as many *s as the cost of the tree containing the node. If there is any component with size greater than 5, break it into components of size at least 3 (resp. k).
Krishnaram KenthapadiPORTIA Workshop, 8 July The final partition
Krishnaram KenthapadiPORTIA Workshop, 8 July Analysis of the algorithm Cluster the row vectors according to this partition Cost incurred ≤ OPT * (size of largest partition) = 5 * OPT. For general k, the cost of this solution is within max{3k-5,2k-1} of the cost of optimal solution.
Krishnaram KenthapadiPORTIA Workshop, 8 July Better than O(k)-approximation? Not possible, using only the graph representation Lose information about the structure of the problem There exist two instances with: Same underlying graph k-anonymity costs differing by a factor of O(k)
Krishnaram KenthapadiPORTIA Workshop, 8 July Open problems Lower bounds on the approximation factor (without assuming the graph representation) Extend the k-anonymity model to account for changes in the database: Handle inserts, deletes and updates