Download presentation
Presentation is loading. Please wait.
1
An Zhu Towards Achieving Anonymity
2
Introduction Collect and analyze personal data Infer trends and patterns Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?
3
Example: Medical Records IdentifiersSensitive Info SSNNameAgeRaceZipcodeDisease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 616Rob46Hisp94042Arthritis
4
De-identified Records Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
5
Not Sufficient! [Sweeney 00’] Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!
6
Not Sufficient! [Sweeney 00’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!
7
Anonymize the Quasi-Identifiers! Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *** Flu *** Cold *** Diabetes *** Flu *** Arthritis *** Heart problem *** Arthritis Public Database Unique Identifiers!
8
Q: How to share such data? Anonymize the quasi-identifiers Suppress information Privacy guarantee: anonymity Quality: the amount of suppressed information Clustering Privacy guarantee: cluster size Quality: various clustering measures
9
Q: How to share such data? Anonymize the quasi-identifiers Suppress information Privacy guarantee: anonymity Quality: the amount of suppressed information Clustering Privacy guarantee: cluster size Quality: various clustering measures
10
k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
11
Each row is identical to at least k-1 other rows k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis
12
Definition: k-anonymity Input: a table consists of n row, each with m attributes (quasi-identifiers) Output: suppress some entries such that each row is identical to at least k-1 other rows Objective: minimize the number of suppressed entries
13
Past Work and New Results [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
14
Past Work and New Results [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
15
Graph Representation 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 4 2 4 6 3 AB F ED C 3 W(e)=Hamming distance between the two rows
16
2 Edge Selection I 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 2 3 AB F ED C Each node selects the lightest weight edge 0 k=3
17
3 Edge Selection II 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 3 AB F ED C For components with <k vertices, add more edges 0 k=3 2
18
Lemma Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at least the weight of the (k-1) st lightest weight edge Forest: at most one edge per vertex By construction, the edge weight is no more than the (k-1) st lightest weight edge per vertex
19
Grouping Ideally, each connected component forms a group Anonymize vertices within a group Total cost of a group: (total edge weights) (number of nodes) (2+2+3+3)6 3 2 2 3 AB F ED C 0 Small groups: O(k)
20
Dividing a Component Root tree arbitrarily Divide if Sub-trees & rest k Aim: all sub-trees <k kk kk kk <k<k <k<k<k<k<k<k kk
21
Dividing a Component Root tree arbitrarily Divide if Sub-trees & rest k Rotate the tree if necessary kk kk kk
22
Dividing a Component Root tree arbitrarily Divide if Sub-trees & rest k T. condition: max(2k-1, 3k-5) <k<k <k<k <k<k <k<k<k<k
23
An Example 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 3 2 2 3 AB F ED C 0
24
0 3 An Example C FE D B A 223 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F:
25
0 3 An Example C FE D B A 22 Estimated cost: 43+33 0*10** **01*1 **01*1 0*10** **01*1 0*10** A: B: C: D: E: F: Optimal cost: 33+33
26
Past Work and New Results [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
27
1.5-approximation 001000 000000 111111 001000 110111 110111 A: B: C: D: E: F: 1 6 5 6 6 AB F ED C 0 W(e)=Hamming distance between the two rows
28
Minimum {1,2}-matching 001000 000000 111111 001000 110111 110111 AB F D Each vertex is matched to 1 or 2 other vertices 0 0 1 E C 1 A: B: C: D: E: F:
29
Properties Each component has 3 nodes Not Optimal Not possible (degree 2) >3
30
Cost 2OPT For binary alphabet: 1.5OPT Qualities apq r p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q) 2(p+q+r)
31
Past Work and New Results [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity
32
Open Problems Can we improve O(k)? (k) for graph representation
33
Open Problems Can we improve O(k)? (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16, c = k d / 2
34
Open Problems Can we improve O(k)? (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16, c = k d / 2
35
Open Problems Can we improve O(k)? (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16, c = 2 d
36
Open Problems Can we improve O(k)? (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16, c = 2 d
37
Q: How to share such data? Anonymize the quasi-identifiers Suppress information Privacy guarantee: anonymity Quality: the amount of suppressed information Clustering Privacy guarantee: cluster size Quality: various clustering measures
38
Clustering Approach [AFKKPTZ 06’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
39
Transfers into a Metric… Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
40
Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis
41
Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu Cold Diabetes Flu 41Afr-A94059Arthritis Heart problem 46Hisp94042Arthritis
42
Measure How good are the clusters “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k radius num_nodes Cost: Gather-k: 10 Cellular-k: 624
43
Measure How good are the clusters “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k radius num_nodes Handle outliers Constant approximations!
44
Comparison k = 5 5-anonymity Suppress all entries More distortion Clustering Can pick R5 as the center Less distortion Distortion is directly related with pair-wise distances R10111 R21011 R31101 R41110 R51111
45
Results [AFKKPTZ 06’] Gather-k Tight 2-approximation Extension to outlier: 4-approximation Cellular-k Primal-dual const. approximation Extensions as well
46
Results [AFKKPTZ 06’] Gather-k Tight 2-approximation Extension to outlier: 4-approximation Cellular-k Primal-dual const. approximation Extensions as well
47
2-approximation Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. A R 2R
48
2-approximation Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.
49
Example: k = 5
50
Optimal Solution 12 R
51
Center Selection
52
1
53
1 2R
54
Center Selection 2R 1
55
Center Selection 2 1 2R
56
Center Selection 2 1 2R
57
Reassignment 2 1
58
Degree Constrained Matching 1 ≥ k-1 =1 2
59
Actual Clustering 1 2
60
Optimal Clustering 12
61
Our guarantees Return clusters of radius no more than 2R If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes Do a binary search on the value of R suffices
62
Binary Search on R Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.
63
Binary Search on R Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.
64
Binary Search on R Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers. If successful, R could be smaller Otherwise, R should be larger
65
Results [AFKKPTZ 06’] Gather-k Tight 2-approximation Extension to outliner: 4-approximation Cellular-k Primal-dual const. approximation Extensions
66
Ignore Cluster Size Constraint Similar to Facility Location radius num_nodes vs. invidual_distance_to_center Caveat Assigning one distant node to an existing cluster will increase cost proportional to number of nodes in that cluster Each cluster is a (center, radius) pair
67
Intermediate Step I Primal-dual constant approximation for radius num_nodes No cluster size constaint Arbitrary cluster setup cost We want radius num_nodes Cluster size constraint No cluster setup cost
68
Enforce Cluster Size Introduce extra cluster setup cost Setup cost pays for k nodes to join a particular cluster, i.e., c setup = k r This at most doubles the actual cost of any size constrained cluster solution Each cluster’s total cost is at least k r
69
Intermediate Step II Shared solution! For each cluster with less than k nodes, additional nodes can join the cluster At no additional cost, paid for by the cluster setup cost Now nodes could be shared among multiple clusters Key: convert a “shared” solution to a disjoint solution
70
Attached Separation Starting from small radius clusters “Open” as long as there are enough nodes The left over points in clusters “attach” to the intersecting smaller radius (open) clusters Open
71
Regroup (k = 5) Open cluster has ≥k nodes Attached cluster has <k nodes Group clusters to create bigger ones Choose the “fat” cluster’s center as the new center 3 2 4 6
72
What About Cluster Cost? These clustering intersects with the open cluster
73
What About Cluster Cost? These clustering intersects with the open cluster Routing cost is only a constant blowup w.r.t. the fat radius
74
What About Cluster Cost? These clustering intersects with the open cluster Routing cost is only a constant blowup w.r.t. the fat radius Need to make sure the merged cluster is of reasonable size
75
Recap Anonymize the quasi-identifiers Suppress information Privacy guarantee: anonymity Quality: the amount of suppressed information Clustering Privacy guarantee: cluster size Quality: various clustering measures
76
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.