Presentation is loading. Please wait.

Presentation is loading. Please wait.

De-identifying Health Data: Measuring and Controlling Disclosure Risk

Similar presentations


Presentation on theme: "De-identifying Health Data: Measuring and Controlling Disclosure Risk"— Presentation transcript:

1 De-identifying Health Data: Measuring and Controlling Disclosure Risk
Traian Marius Truta

2 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

3 Global Disclosure Risk Measures
Assumptions The intruder does not know any confidential information. The intruder knows all the key and identifier values for population. Objectives DR Measures for specific DC methods (Remove Identifiers, Sampling, Microaggregation, etc.). DR Measures for any combinations of DC methods. Proposed measures DRmin  DRW  Drmax [Truta 2003, Truta 2004] April 30, 2009 Traian Marius Truta – DIMACS Tutorial

4 Notations for IM and IMM
n – the number of entities in the population. F – the number of clusters with the same values for key attributes. Ak – the set of elements from the k-th cluster for all k, 1  k  F. Fi = | {Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1  i  n. Fi represents the number of clusters with the same length. ni =| {x  Ak | |Ak| = i, for all k = 1, .., F } | for all i, 1  i  n. ni represents the number of records in clusters of length i. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

5 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

6 Disclosure Risk Measures for Remove Identifiers Method
RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 {1, 2, 4} {3, 5, 9} {6, 10} {7} {8} n =10 n1 = 2 n2 = 2 n3 = 6 F = 5 F1 = 2 F2 = 1 F3 = 2 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

7 Disclosure Risk Measures for Remove Identifiers Method
- percentage of unique records. - considers probabilistic linkage. - weights defined by data owner. w = (w1, w2, …, wN) disclosure risk weight vector. Properties a) wi  R+ for all i = 1, .. , n; b) wi  wj for all i  j, i,j = 1, .. , n; April 30, 2009 Traian Marius Truta – DIMACS Tutorial

8 Disclosure Risk Measures for Remove Identifiers Method
RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 n =10 n1 = 2 n2 = 2 n3 = 6 F = 5 F1 = 2 F2 = 1 F3 = 2 w1 = (5, 5, 0, 0, ..., 0) w2 = (4, 3, 3, 0, ..., 0) DRmin DRw1 DRw2 DRmax 0.2 0.3 0.425 0.5 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

9 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

10 Notations for Masked Microdata
f – the number of clusters with the same values for key attributes in M. We cluster all records from M based on their key values. Bk – the set of elements from the k-th cluster for all k, 1  k  f. fi = | {Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1  i  n. fi represents the number of clusters with the same length. ti =| {x  Bk | |Bk| = i, for all k = 1, .., f } | for all i, 1  i  n. ti represents the number of records in clusters of length i. C – the classification matrix. For all i, j = 1, .., n; cij ==| {x  Bk and x  Ap | |Bk| = i, for all k = 1, .., f and |Ap| = j, for all p = 1, .., F }|. Each element of C, cij, represents the number of records that appears in clusters of size i in the masked microdata and appeared in clusters of size j in the initial masked microdata. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

11 Disclosure Risk Measures for Sampling
RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 n =10 n1 = 2 n2 = 2 n3 = 6 F = 5 F1 = 2 F2 = 1 F3 = 2 t = 5 t1 = 2 t2 = 0 t3 = 3 f = 3 f1 = 2 f2 = 0 f3 = 1 RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 4 21,000 1,000 8 35 66,000 2,200 9 55 69,000 4,200 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

12 Algorithm for Creating Classification Matrix
Initialize each element from C with 0. For each element s from masked microdata MM do Count the number of occurrences of key values of s in masked microdata MM.Let i be this number. Count the number of occurrences of key values of s in initial microdata IM.Let j be this number. Increment cij by 1. End for. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

13 Disclosure Risk Measures for Sampling
- disclosure risk weight vector April 30, 2009 Traian Marius Truta – DIMACS Tutorial

14 Disclosure Risk Measures for Sampling
RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 3 55 67,000 3,000 4 21,000 1,000 5 90,000 900 6 45 Diabetes 48,000 750 7 25 IN 49,000 8 35 66,000 2,200 9 69,000 4,200 10 Tuberculosis 34,000 3,100 RecID Age State Diagnosis Income Billing 1 44 MI AIDS 45,500 1,200 2 Asthma 37,900 2,500 4 21,000 1,000 8 35 66,000 2,200 9 55 69,000 4,200 DRmin DRw1 DRw2 DRmax 0.1 0.144 0.233 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

15 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

16 Disclosure Risk Measures for Microaggregation Method
Initial Microdata RecID Name SSN Age Sex Diagnosis 1 John Wayne 8 Male AIDS 2 Pete Gore 10 Asthma 3 John Banks 19 4 Jessica Casey 23 Female 5 Mary Stone 37 6 Patricia Kopi 43 Diabetes 7 Stan Simms 68 Kim Wood 72 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

17 Disclosure Risk Measures for Microaggregation Method
Univariate microaggregation for attribute Age and size = 2,4,8. RecID Age Sex Diagnosis 1 9 Male AIDS 2 Asthma 3 21 4 Female 5 40 6 Diabetes 7 70 8 RecID Age Sex Diagnosis 1 15 Male AIDS 2 Asthma 3 4 Female 5 55 6 Diabetes 7 8 RecID Age Sex Diagnosis 1 35 Male AIDS 2 Asthma 3 4 Female 5 6 Diabetes 7 8 Masked Microdata 1 Masked Microdata 2 Masked Microdata 3 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

18 Disclosure Risk Measures for Microaggregation Method
April 30, 2009 Traian Marius Truta – DIMACS Tutorial

19 Disclosure Risk Measures for Microaggregation Method
Example – Disclosure risk values W1 W2 W3 W4 MM1 0.5 0.75 0.612 MM2 0.25 0.367 MM3 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

20 Disclosure Risk Measures for Sampling and Microaggregation Methods
April 30, 2009 Traian Marius Truta – DIMACS Tutorial

21 Disclosure Risk Measures for Sampling and Microaggregation Methods
April 30, 2009 Traian Marius Truta – DIMACS Tutorial

22 Global Disclosure Risk Measures
Remove Identifiers Sampling Microaggregation, Top and Bottom Coding, etc. Combination of those methods This approach does not work for Random Noise, Data Swapping Global and Local Recoding, etc. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

23 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

24 General Disclosure Risk Measures
Ordered Attribute Partial Ordered Attribute Unordered Attribute Inversion Week and strong change Change Inversion Factor Change Factor Inversion-Change Factor April 30, 2009 Traian Marius Truta – DIMACS Tutorial

25 Inversion Factor Inversions for Age: (1, 2); (3, 4) and (3, 5)
Initial Microdata Masked Microdata RecID Age Zip Diagnosis Income 1 17 48202 AIDS 17,000 2 24 68,000 3 44 48201 Asthma 80,000 4 55 48310 55,000 5 71 Diabetes 23,000 RecID Age Zip Diagnosis Income 1 34 48202 AIDS 17,000 2 24 68,000 3 81 48201 Asthma 80,000 4 55 48310 55,000 5 71 Diabetes 23,000 Inversions for Age: (1, 2); (3, 4) and (3, 5) ifAge = 3 / 5 = 0.6 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

26 Change Factor Strong change for Zip: 4 Weak change for Zip: 3
Initial Microdata Masked Microdata RecID Age Zip Diagnosis Income 1 17 48202 AIDS 17,000 2 24 68,000 3 44 48201 Asthma 80,000 4 55 48310 55,000 5 71 Diabetes 23,000 RecID Age Zip Diagnosis Income 1 17 48202 AIDS 17,000 2 24 68,000 3 44 48235 Asthma 80,000 4 55 89340 55,000 5 71 48310 Diabetes 23,000 Strong change for Zip: 4 Weak change for Zip: 3 cfAge = (wcf(48201, 48235) + 1 ) / 5 = ( ) / 5 = 0.28 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

27 General Disclosure Risk Measures
icfk – inversion-change factor for attribute k. p – number of key attributes. v – binary vector associated to key attribute. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

28 General Disclosure Risk Measures
Lemma For every disclosure risk weights matrix W the following relations are true: DRmin  DRW  DRmax For every disclosure risk weights matrix W, 0  DRW  1 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

29 General Disclosure Risk Measures
Initial Microdata Masked Microdata RecID SSN Zip Code Age Gender 1 48202 20 M 2 25 F 3 30 4 48201 35 5 6 48310 40 7 8 67890 42 9 48319 10 RecID Zip Code Age Gender 1 482 20 M 2 25 F 3 4 35 5 6 483 7 40 8 678 30 9 42 10 Age attribute value for records 2, 3, 4, 6, 8, 9 and 10 are changed, followed by global recoding for Zip Code attribute. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

30 General Disclosure Risk Measures
K(1,1,1) = {Zip, Age, Sex} K(1,0,1) = {Zip, Sex} K(0,1,1) = {Age, Sex} K(0,0,1) = {Sex} Result V DRmin 0.176 0.12 0.146 vmin=(1,1,1) DRW 0.205 0.220 vW=(0,1,1) DRmax 0.308 0.24 0.440 0.2 vmax=(0,1,1) April 30, 2009 Traian Marius Truta – DIMACS Tutorial

31 Traian Marius Truta – DIMACS Tutorial
Experimental Data Simulated medical record billing data. Age, Sex, Zip and Amount_Billed. Three initial microdata: n = 1,000 (called IM1000). n = 5,000 (IM5000). n = 25,000 (IM25000). Key attributes: KA = {Age, Sex, Zip}. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

32 Sampling when KA is the set of key attributes
Results for Sampling Sampling when KA is the set of key attributes April 30, 2009 Traian Marius Truta – DIMACS Tutorial

33 Results for Sampling and Microaggregation
Sampling, followed by microaggregation for Age when IM5000 and KA are used. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

34 Results for Sampling and Microaggregation
Sampling and microaggregation for Age when IM5000 and KA are used. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

35 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

36 K-anonymization by Clustering
Let IM be the initial microdata set. The k-anonymization by clustering problem is to find a partition S = {cl1, cl2, … , clv} of IM, where clj  IM, j=1..v, are called clusters and: IM ; , i, j=1..v, ij ; |clj |  k, j=1..v ; is minimized. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

37 Generalization Information
Let cl = {r1, r2, …, rq}  S be a cluster, KN = {N1, N2, ..., Ns} be the set of numerical quasi-identifier attributes and KC = {C1, C2,,…, Ct} be the set of categorical quasi-identifier attributes. The generalization information of cl, w.r.t. quasi-identifier attribute set K = KN  KC is the “tuple” gen(cl), having the scheme K, where: For each categorical attribute Cj  K , gen(cl)[Cj] = the lowest common ancestor in HCj of {r1[Cj], …, rq[Cj]}; For each numerical attribute Cj  K , gen(cl)[Cj] = the interval [min{r1[Cj], …, rq[Cj]}, max{r1[Cj], …, rq[Cj]}]. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

38 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

39 Traian Marius Truta – DIMACS Tutorial
Two IL Measures Discernability metric (DM) penalizes each tuple with the size of the group it belongs. intuitively, the ideal grouping is the one in which all groups have size k. DM (S) = [Bayardo 2005] Normalized average cluster size metric (AVG) inversely proportional with the number of clusters (v). minimizing AVG is equivalent to maximizing the total number of clusters. AVG (S) = [LeFevre 2006] April 30, 2009 Traian Marius Truta – DIMACS Tutorial

40 Information Loss for a Cluster
April 30, 2009 Traian Marius Truta – DIMACS Tutorial

41 Total Information Loss
Total information loss for a solution S = {cl1, cl2, … , clv} of the k-anonymization by clustering problem is the sum of the information loss measure for all the clusters in S. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

42 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity April 30, 2009 Traian Marius Truta – DIMACS Tutorial

43 Algorithms for k-clustering problem
Greedy_k-member_Clustering [Byun 2006] 1. Create one cluster cl with one tuple randomly selected from IM. 2. Finds the “closer” wrt IL tuple r. Add r to cl. 3. Repeat step 2 until cl has k tuples. 4. Save cl in the set of final clusters S. 5. IM = IM – cl. 6. Restart from 1 with the new IM . Note: The last IM (with size less than k) are added to the last computed cluster. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

44 Other Algorithms for K-Anonymity
The general version of optimal k-anonymization for a microdata is a NP-hard problem. [Aggarwal 2006, Meyerson 2004] Curse of dimensionality – for many QI attributes. [Aggarwal 2005] Binary Search [Samarati 2001] Incognito [LeFevre 2005] Mondrian [LeFevre 2006] Genetic [Samarati 2001] Clustering [Aggarwal 2006, Byun 2006] April 30, 2009 Traian Marius Truta – DIMACS Tutorial

45 Traian Marius Truta – DIMACS Tutorial
Content of the Talk Global Disclosure Risk Remove Identifiers Sampling Microaggregation Any combination of masking techniques Anonymity models Information Loss Greedy algorithms Constrained k-anonymity [Miller 2008] April 30, 2009 Traian Marius Truta – DIMACS Tutorial

46 Maximum Allowed Generalization Value
Let Q be a quasi-identifier attribute (categorical or numerical), and HQ its predefined value generalization hierarchy. For every leaf value v  HQ, the maximum allowed generalization value of v, denoted by MAGVal(v), is the value (leaf or not-leaf) in HQ situated on the path from v to the root, such that: for any released microdata, the value v is permitted to be generalized only up to MAGVal(v) and when several MAGVals exist on the path between v and the hierarchy root, then the MAGVal(v) is the first MAGVal that is reached when following the path from v to the root node. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

47 Maximum Allowed Generalization Value
*Kansas* United States *Midwest* *California* Nebraska San Diego Los Angeles Wichita Kansas City Lincoln West Coast MAGVal(“San Diego”) = “California”. MAGVal(“Wichita”) = “Kansas” (Part 2 of the definition !). MAGVal(“Lincoln”) = “Midwest”. MAGSet(Country) = {“California”, “Kansas”, “Midwest”}. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

48 Constrained K-Anonymity
Constraint Violation We say that the masked microdata MM has a constraint violation if one quasi-identifier value, v, in IM, is generalized in one tuple in MM beyond its specific maximal generalization value, MAGVal(v). Constrained K-Anonymity The masked microdata MM satisfies the constrained k-anonymity property if it satisfies k-anonymity and it does not have any constraint violation. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

49 Constrained K-Anonymity Example
RecID Name Age Location Sex Race Diagnosis 1 Alice 32 San Diego Male White AIDS 2 Bob 30 Los Angeles Asthma 3 Charley 42 Wichita 4 Dave Kansas City 5 Eva 35 Lincoln Female Diabetes 6 John 20 Black 7 Casey 25 We use the MAGVals defined for the Location attribute. For Age, Sex, and Race we assume the root of the hierarchy is the MagVal or every leaf value (these QI attributes are unconstrained). April 30, 2009 Traian Marius Truta – DIMACS Tutorial

50 Constrained K-Anonymity Example
RecID Age Location Sex Race Diagnosis 1 30-32 California Male White AIDS 2 Asthma 3 30-42 Midwest * 4 5 Diabetes 6 20-25 Black 7 Satisfies 2-anonymity. Does not satisfy constrained 2-anonymity (Kansas City and Wichita are generalized past their MAGVals). April 30, 2009 Traian Marius Truta – DIMACS Tutorial

51 Constrained K-Anonymity Example
RecID Age Location Sex Race Diagnosis 1 30-32 California Male White AIDS 2 Asthma 3 25-42 Kansas * 4 5 20-35 Midwest Diabetes 6 7 Satisfies 2-anonymity and constrained 2-anonymity. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

52 Traian Marius Truta – DIMACS Tutorial
A Few Definitions Constrained K-Anonymization by Clustering Find a partition S = {cl1, cl2, … , clv, clv+1} . clv+1 – tuples that must be suppressed (minimum set). Cost measure is optimized (IL, see its definition in paper). Generalization Information gen(cl) – the least generalized tuple that represents the entire cluster cl. Maximum Allowed Microdata (MAM) Every QI value v is generalized to MAGVal(v). April 30, 2009 Traian Marius Truta – DIMACS Tutorial

53 Traian Marius Truta – DIMACS Tutorial
A Few Properties For a given IM, if its maximum allowed microdata MAM is not k-anonymous, then any masked microdata obtained from IM by applying generalization only will not satisfy constrained k-anonymity. If MAM satisfies k-anonymity then MAM satisfies the constrained k-anonymity property. An initial microdata, IM, can be masked to comply with constrained k-anonymity using only generalization if and only if its corresponding MAM satisfies k-anonymity. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

54 Can constrained k-anonymity be achieved?
Can an initial microdata IM can be masked to satisfy the constrained k-anonymity property using generalization only? We follow the next two steps: Compute MAM for IM. This is done by replacing each quasi-identifier attribute value with its corresponding MAGVal. If all QI-clusters from MAM have at least k entities than the IM can be masked to satisfy constrained k-anonymity. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

55 Traian Marius Truta – DIMACS Tutorial
More Properties OUT - represents all entities from QI-clusters from MAM with size < k. IM \ OUT can be masked using generalization only to comply with constrained k-anonymity. Any subset of IM that contains one or more entities from OUT cannot be masked using generalization only to achieve constrained k-anonymity. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

56 Traian Marius Truta – DIMACS Tutorial
Algorithm GreedyCKA Input IM – initial microdata; k – as in k-anonymity; Output S ={cl1,cl2,… clv,clv+1} - a solution for the constrained k-anonymization for IM; Compute MAM and OUT; S = ; For each QI-cluster from MAM \ OUT, cl, { // By cl we refer to the entities from IM that are clustered together in MAM. S’ = Greedy_k-member_Clustering(cl, k); // [Byun] S = S  S’; } v = | S |; clv+1 = OUT; End GreedyCKA; April 30, 2009 Traian Marius Truta – DIMACS Tutorial

57 GreedyCKA - Two-Stage Process
Initial microdata IM QI-clusters in MAM Stage 1, forming MAM Suppressed tuples Stage 2, apply a k-anonymization algorithm on every MAM cluster with more than k elements. Final QI-clusters in IM, k=3 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

58 Traian Marius Truta – DIMACS Tutorial
Test Data Adult dataset from the UC Irvine Machine Learning Repository. [Newman 1998] QI = {Education-num, Work-class, Marital-status, Occupation, Race, Sex, Age, Native-country}. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

59 MAGVals and Generalization Hierarchies
USA *America* Country *Africa* *North_A* *Asia* *Europe* South_A *Central_A* *West_E* East_E West_A *East_A* *North_Af* South_Af Greece Italy Canada South Africa *0-19* 0-100 0-9 60-100 10-19 *20-29* *60-69* *70-100* 1 *30-39* *40-49* 20-59 *50-59* 100 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

60 Information Loss Results
April 30, 2009 Traian Marius Truta – DIMACS Tutorial

61 Traian Marius Truta – DIMACS Tutorial
Running Time Results April 30, 2009 Traian Marius Truta – DIMACS Tutorial

62 Constraint violations in Greedy_k-member_Clustering
No of constraint violations for 1 constrained attribute – native_country No of constraint violations for 2 constrained attributes – native_country, age 2 605 2209 3 991 3824 4 1377 5297 5 1657 6163 6 1906 6964 7 2198 7743 8 2354 8417 9 2550 8931 10 2728 9549 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

63 Number of tuples suppressed by GreedyCKA
2 3 4 5 6 7 8 9 10 No of suppressed tuples for 1 constrained attribute – native_country No of suppressed tuples for 2 constrained attributes– native_country, age 15 24 28 48 60 81 97 106 April 30, 2009 Traian Marius Truta – DIMACS Tutorial

64 Traian Marius Truta – DIMACS Tutorial
References Aggarwal C. (2005), On k-Anonymity and the Curse of Dimensionality, Proceedings of the Very Large Databases (VLDB), 223 – 228. Aggarwal G., Feder T., Kenthapadi K., Motwani R., Panigrahy R., Thomas D., and Zhu A. (2005), Anonymizing Tables, Proceedings of the 10th International Conference on Database Theory, 246 – 258. Bayardo R.J, Agrawal R. (2005), Data Privacy through Optimal k-Anonymization, In Proceedings of the IEEE ICDE, 217 – 228. Byun J.W., Kamra A., Bertino E, Li N. (2006), Efficient k-Anonymity using Clustering Technique, CERIAS Tech Report LeFevre K., DeWitt D., Ramakrishnan R. (2005), Incognito: Efficient Full-Domain K-Anonymity, Proceedings of the ACM SIGMOD, Baltimore, Maryland, 49 – 60. LeFevre K., DeWitt D., Ramakrishnan R. (2006), Mondrian Multidimensional K-Anonymity, Proceedings of the IEEE International Conference of Data Engineering, Atlanta, Georgia, 25. Meyerson A., Williams R. (2004), On the Complexity of Optimal k-Anonymity, Proceedings of the ACM PODS Conference, 223 – 228. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

65 Traian Marius Truta – DIMACS Tutorial
References Miller J., Campan A., Truta T.M. (2008), Constrained K-Anonymity: Privacy with Generalization Boundaries, Proceedings of the Practical Preserving Data Mining Workshop (P3DM2008), In Conjunction with SIAM Conference on Data Mining (SDM), Atlanta. Newman D.J., Hettich S., Blake C.L., Merz C.J. (1998), UCI Repository of Machine Learning Databases. Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010 – 1027. Truta T.M., Fotouhi F., Barth-Jones D. (2003), Privacy and Confidentiality Management for the Microaggregation Disclosure Control Method, Proceedings of the Workshop on Privacy and Electronic Society, In Conjunction with 10th ACM CCS, Washington DC, 21 – 30. Truta T.M., Fotouhi F., Barth-Jones D. (2003), Disclosure Risk Measures for Microdata, Proceedings of the International Conference on Scientific and Statistical Database Management, Cambridge, Ma, 15 – 22. Truta T.M., Fotouhi F., Barth-Jones D. (2004), Disclosure Risk Measures for Sampling Disclosure Control Method, Proceedings of ACM Symposium on Applied Computing, Truta T.M., Fotouhi F., Barth-Jones D. (2004), Assessing Global Disclosure Risk Measures in Masked Microdata, Proceedings of the Workshop on Privacy and Electronic Society, In Conjunction with 11th ACM CCS, Washington DC, 85 – 93. April 30, 2009 Traian Marius Truta – DIMACS Tutorial

66 Traian Marius Truta – DIMACS Tutorial
Questions April 30, 2009 Traian Marius Truta – DIMACS Tutorial


Download ppt "De-identifying Health Data: Measuring and Controlling Disclosure Risk"

Similar presentations


Ads by Google