Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Similar presentations


Presentation on theme: "Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)"— Presentation transcript:

1 Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2 2 Outline  Introduction  k-Anonymity  Generalization & Suppression  MinGen – Theoretical Algorithm  Mondrian – A greedy partition algorithm

3 3 What is Privacy ?  Society is experiencing exponential growth in the number and variety of data collections containing person-specific information.  Sharing these collected information is valuable both in research and business. Publishing the data may put person privacy in risk.  Objective: Maximize data utility while limiting disclosure risk to an acceptable level  Note : There is no clear definition for disclosure and acceptable level There is no clear definition for disclosure and acceptable level Not the traditional security of data e.g. access control, theft, hacking etc. Not the traditional security of data e.g. access control, theft, hacking etc.

4 4 Example  For medical research (e.g., Gene, infection diseases) a hospital has some person-specific patient data which it wants to publish  It wants to publish such that: Information remains practically useful Information remains practically useful Identity of an individual cannot be determined Identity of an individual cannot be determined  Adversary might infer the secret/sensitive data from the published database

5 5 Example – cont.  The data contains: Identifiers - {name, ssn} Identifiers - {name, ssn} Non-Sensitive data - {zip-code, nationality, age} Non-Sensitive data - {zip-code, nationality, age} Sensitive data - { medical condition, salary, location } Sensitive data - { medical condition, salary, location } Identifiers Non-Sensitive data Sensitive data #NameZipAgeNationalityCondition 1Kumar1305328IndianHeart Disease 2Bob1306729AmericanHeart Disease 3Ivan1305335CanadianViral Infection 4Umeko1306736JapaneseCancer

6 6 Example – cont [SW02-A] Non-Sensitive Data Sensitive Data #ZipAgeNationalityCondition 11305328IndianHeart Disease 21306729American 31305335CanadianViral Infection 41306736JapaneseCancer Published Data Chris Bob JohnNameAmerican23130533 American29130672 American28130531NationalityAgeZip# Voter List Data leak! Do we have a privacy violation ?

7 7   The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health.   Voter registration list for Cambridge Massachusetts – sold for 20$   William Weld was governor of Massachusetts- Lived in Cambridge Massachusetts Six people had his particular birth date Three of them were men He was the only with 5-digit ZIP code. Example – cont [SW02-A] Zip Birthdate Gender Ethnicity Visit date Diagnosis Procedure Medication Total charge Name Address Date registered Party affiliation Date last voted Medical data Voter List Quasi Identifier)QI)

8 8 Example-2 – AOL (2006) ClickURL Item RankQueryTimeQuery Anon ID http://www.konigwheels.com118/04/2006 13:29konig wheels1326 27/04/2006 15:29jet blue airlines1326 28/04/2006 15:53coats tire equipment1326 03/05/2006 19:15coats tire equipment1326 09/05/2006 00:09verizon wireless1326 23/05/2006 18:00www.crazyradiodeals.com1326 http://www.seda-cog.org101/03/2006 11:50uslandrecords.com1337 14/03/2006 15:45titlesourcein.com1337 http://www.titlesourceinc.co m114/03/2006 15:45titlesourceinc1337 14/03/2006 15:51select business services1337 14/03/2006 15:52select business services title1337 http://www.cbc- companies.com214/03/2006 15:52cbc companies1337 http://www.cbc- companies.com314/03/2006 15:52cbc companies1337 http://www.realtms.com114/03/2006 15:59 national real estate settlement services1337

9 Example2 – cont.

10 Example-3

11 11 k-Anonymity [SW02-A]  Change data in such a way that for each tuple in the resulting table there are at least (k-1) other tuples with the same value for the quasi-identifier – k-Anonymized table #ZipAgeNationalityCondition 1130**< 40*Heart Disease 2130**< 40*Heart Disease 3130**< 40*Viral Infection 4130**< 40*Cancer This is a 4-anonymized Table. Why ?

12 12 K-Anonymity – Formal Definition  RT - Released Table  (A1,A2, …,An) - Attributes  QI RT - Quasi Identifier  RT[QI RT ] – Projection of RT on QI RT

13 13 K-Anonymity Example [SW02-B] ProblemZIPGenderBirthCountry short breath02141m1965USAt1 chest pain02141m1965USAt2 obesity02138f1964USAt3 chest pain02138f1964USAt4 chest pain02138m1964Non-USAt5 obesity02138m1964Non-USAt6 short breath02138m1964Non-USAt7 Example of k-anonymity, where k=2 and QI={Country, Birth, Gender, ZIP}

14 14 K-Anonymity – The challenge   Theorem 1 in [SW02-B] claims : Let RT(A 1,...,A n ) be a table, QI RT =(A i,…, A j ) be the quasi-identifier associated with RT, A i,…,A j  A 1,…,A n, and RT satisfy k-anonymity. Then, each sequence of values in R T [A x ] appears with at least k occurrences in RT[QI RT ] for x=i,…,j.   Can we use this property for easily building of a k-Anonymity table ? (Can we claim the opposite ?) (each sequence of values in R T [A x ] appears with at least k occurrences then the table is k-anonymity?)

15 15 K-Anonymity – The challenge – cont. #ZipAgeNationalityCondition 1120*Heart Disease 2130*Heart Disease 3220*Viral Infection 4230*Cancer No !!!

16 16  Generalization Replace the original value by a semantically consistent but less specific value Replace the original value by a semantically consistent but less specific value  Suppression Data not released at all Data not released at all Can be viewed as first level of generalization Can be viewed as first level of generalization How to create k-Anonymity ? #ZipAgeNationalityCondition 1130**< 40*Heart Disease 2130**< 40*Heart Disease Generalization Suppression

17 17 Generalization & Hierarchies ZIP 1305813053 1305 130  1306713063 1306 Age 2928 < 30 < 40 * 3536 3* Nationality USCanadian American Japanese Indian Asian * Z 0 ={13053,13058,13063,13067} Z 1 ={1305*,1306*} Z 2 ={130**} Z 3 ={*****} Z0Z0 Z1Z1 Z2Z2 Z3Z3 Z0Z0 Z1Z1 Z2Z2

18 18 Generalization & Hierarchies  The number of generalized tables is : (DGH i = Maximum generalization level of A i ) (note, not all generalization creates a k-anonymity table)

19 19 #ZipAgeNationalityCondition 113053< 40*Heart Disease 213053< 40*Viral Infection 313067< 40*Heart Disease 413067< 40*Cancer #ZipAgeNationalityCondition 1130**< 30AmericanHeart Disease 2130**< 30AmericanViral Infection 3130**3*AsianHeart Disease 4130**3*AsianCancer #ZipAgeNationalityCondition 1130**< 40*Heart Disease 2130**< 40*Viral Infection 3130**< 40*Heart Disease 4130**< 40*Cancer

20 20 K-minimal Generalizations  Intuition: The one that does not generalize the data more than needed (decrease in utility of the published dataset!)  K-minimal generalization: T m is said to be a minimal generalization of RT if T m satisfies the k-anonymity requirement with respect to QI RT T m satisfies the k-anonymity requirement with respect to QI RT  T z : RT  T z,T z  T m, T z satisfies the k-anonymity requirement with respect to QI RT   T z =T m  T z : RT  T z,T z  T m, T z satisfies the k-anonymity requirement with respect to QI RT   T z =T m

21 21 #ZipAgeNationalityCondition 113053< 40*Heart Disease 213053< 40*Viral Infection 313067< 40*Heart Disease 413067< 40*Cancer #ZipAgeNationalityCondition 1130**< 30AmericanHeart Disease 2130**< 30AmericanViral Infection 3130**3*AsianHeart Disease 4130**3*AsianCancer 2-minimal Generalizations #ZipAgeNationalityCondition 1130**< 40*Heart Disease 2130**< 40*Viral Infection 3130**< 40*Heart Disease 4130**< 40*Cancer NOT a 2-minimal Generalization There are many k-minimal anonymized tables – which one to pick?

22 22 K-minimal Generalizations  There are many k-minimal generalizations – which one is preferred then?  No clear and “correct” answer : The one that creates min. distortion to data, where distortion The one that creates min. distortion to data, where distortion Normalized average equivalence class size metric Normalized average equivalence class size metric The one with min. suppression The one with min. suppression Best support the research (less damaging the “interesting” attributes) Best support the research (less damaging the “interesting” attributes)

23 23 Algorithm for finding minimal generalization [SW02-B]  Theoretical Model (MinGen) Store the set of all possible generalizations of RT over QI into allgens Store the set of all possible generalizations of RT over QI into allgens Store from allgens all the tables which satisfied k-anonymity into protected Store from allgens all the tables which satisfied k-anonymity into protected Define comparing measure score Define comparing measure score From protected choose the table with best score From protected choose the table with best score

24 24 Algorithm for finding minimal generalization  The search space is exponential  The problem is NP-Hard!  We present one proposed algorithm[LDR06]- LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 - Multi-dimensional algorithm (Mondrian)

25 25 Single Dimensional Partitioning  A single dimensional partitioning defines for each attribute A i, a set of non overlapping single-dimensional intervals that cover D Xi. Age 20 22 24 26 30 31 38 40 42 44 Age 20-24 26-31 38-44 DataPartitionin g

26 26 Single Dimensional Partitioning 20 24 26 31 38 44 212021302140 Age Zip Code 212921392149 12 Areas of Partitioning

27 27 Multidimensional Partitioning  Assume all attributes are from discrete numeric domain (every set can be mapped to a one)  The domain of A i is denoted by D Xi  Each tuple can be presented as (v 1,v 2,…,v d )  D X1   D X2   … D Xn  A multidimensional partitioning defines a set of multidimensional regions.

28 28 Multidimensional Partitioning – cont. Attributes = {ZipCode,Age)

29 29 Multidimensional Partitioning – Why is it good ? ZipcodeSexAgeName 53710Male25Ahmed 53711Male28Bob 90210Female31Claire 2174Male19Dave 2237Female40Evelyn Voter Registration Data Patient Data DiseaseZipcodeSexAge Flu53710Male25 Hepatitis53712Female25 Brochitis53711Male26 Broken Arm53710Male27 AIDS53712Female27 Brochitis53711Male28

30 30 Multidimensional Partitioning –cont. Single Dimensional Multi Dimensional Bronchitis 53710-11Male25-28 Broken Arm53710-11Male25-28 Bronchitis53710-11Male25-28 Flu53710-11Male25-28 DiseaseZipcodeSexAge Bronchitis 53710-11Male27-28 Broken Arm53710-11Male27-28 Bronchitis53710-11Male25-26 Flu53710-11Male25-26 DiseaseZipcodeSexAge DiseaseZipcodeSexAge Flu53710Male25 Bronchitis53711Male26 Broken Arm53710Male27 Bronchitis53711Male28 Hepatitis53712Female25 AIDS53712Female27 AIDS53712Female25-28 Hepatitis53712Female25-28 AIDS53712Female 25-27 Hepatitis53712Female 25-27

31 31 Finding k-Anonymous Multidimensional Partitioning  Given a set P of unique (point,count), with points in d-dimensional space, is there a multidimensional partitioning for P such that: For every region R i,  p  Ri count(p)  k or  p  Ri count(p) =0 (k-anonymity) For every region R i,  p  Ri count(p)  k or  p  Ri count(p) =0 (k-anonymity) C AVG  c (positive constant)? (average number of records in each partition) C AVG  c (positive constant)? (average number of records in each partition)  This problem is NP-Complete  Proof : reduction from partition

32 Weight 3545405550656070 50 55 60 65 70 75 80 85 Age A Greedy Partitioning Algorithm [LDR06] Mondrian - A Greedy Partitioning Algorithm [LDR06] k-anonymity, k = 3 Mondrian(partition)  if (no allowable multidimensional cut for partition) return  : partition  summary  else dim  choose dimension() fs  frequency set(partition, dim) splitVal  find median(fs) lhs  {t  partition : t.dim  splitVal} rhs  {t  partition : t.dim > splitVal} return Mondrian(rhs)  Mondrian(lhs)

33 33 Mondrian – Example [LDR06] Anonymizations for two attributes with a discrete normal distribution (  = 25,  = 2)

34 34 Mondrian Quality  By definition of k-Anonymity:  From Theorem 2 in [LeFevre et al. 06’]: The maximum number of points in any region (R i ) is 2d*(k-1)+m, where m is the maximum number of copy of any distinct point in P  For constant d,m,k - C AVG  2*C AVG*

35 The maximum number of points in any region (Ri) is 2d*(k-1)+m

36 Piet Mondrian (1872-1944) (*) wikipedia

37 Privacy – Last Example

38 38

39 39 Bibliography  [SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. Sweeney,2002  [SW02-B] “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression”, L. Sweeney, 2002  [LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006  http://en.wikipedia.org/wiki/Piet_Mondrian  Presentations: “Privacy In Databases”, B. Aditya Prakash “Privacy In Databases”, B. Aditya Prakash “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan


Download ppt "Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)"

Similar presentations


Ads by Google