Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.

Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection2

Privacy Preserving Data Mining What is data mining? Non-trivial extraction of implicit, previously unknown, and potentially useful information from large data sets or databases [W. Frawley and G. Piatetsky-Shapiro and C. Matheus, 1992] What is privacy preserving data mining? Study of achieving some data mining goals without scarifying the privacy of the individuals Data Mining And Privacy Protection3

Scenario (Information Sharing) A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification analysis without scarifying the privacy of the individuals in the released data. Data owner Data recipients Person-specific data 4Data Mining And Privacy Protection

key directions in the field of privacy- preserving data mining  Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy.  Changing the results of Data Mining Applications to preserve privacy : In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the privacy of the data.  Privacy-Preserving Distributed Data Mining : In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. 6 Data Mining And Privacy Protection

Randomization Approach Overview 50 | 40K |...30 | 25 |...... Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20 |...25 | 60K |...... 8Data Mining And Privacy Protection

Randomization The method of randomization can be described as follows.  x={x 1 …x N }, For record x i X  we add a noise component y 1 …y N, which is drawn from the probability distribution f Y (y),  the new set of distorted records are x 1 +y 1 ….x N +y N  In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data,Thus, the original records cannot be recovered, but the distribution of the original records can be recovered. 9Data Mining And Privacy Protection

Randomization 10Data Mining And Privacy Protection

Reconstruction Problem  Original values x 1, x 2,..., x n  from probability distribution X (unknown)  To hide these values, we use y 1, y 2,..., y n  from probability distribution Y (known)  Given  x 1 +y 1, x 2 +y 2,..., x n +y n  the probability distribution of Y Estimate the probability distribution of X. 11Data Mining And Privacy Protection

Intuition (Reconstruct single point) 12Data Mining And Privacy Protection

Intuition (Reconstruct single point) 13Data Mining And Privacy Protection

Reconstructing the Distribution  Combine estimates of where point came from for all the points:  Gives estimate of original distribution. 14Data Mining And Privacy Protection

Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := (Bayes' rule) j := j+1 until (stopping criterion met) 15Data Mining And Privacy Protection

Seems to work well! 16Data Mining And Privacy Protection

Pros & Cons  One key advantage of the randomization method is that it is relatively simple, and does not require knowledge of the distribution of other records in the data. 17Data Mining And Privacy Protection

Pros & Cons  we only have a distribution containing the behavior of X. Individual records are not available.  the distributions are available only along individual dimensions.  While the approach can certainly be extended to multi-dimension distributions, density estimation becomes inherently more challenging with increasing dimensionalities. For even modest dimensionalities such as 7 to 10, the process of density estimation becomes increasingly inaccurate, and falls prey to the curse of dimensionality 18Data Mining And Privacy Protection

k-anonymity  the role of attributes in data  explicit identifiers are removed  quasi identifiers can be used to re-identify individuals  sensitive attributes (may not exist!) carry sensitive information NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS identifierquasi identifierssensitive NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS Data Mining And Privacy Protection20

k-anonymity  preserve privacy via k-anonymity, proposed by Sweeney and Samarati  k-anonymity: intuitively, hide each individual among k-1 others  each QI set of values should appear at least k times in the released data  sensitive attributes are not considered (going to revisit this...)  how to achieve this?  generalization and suppression  value perturbation is not considered (we should remain truthful to original values )  privacy vs utility tradeoff  do not anonymize more than necessary Data Mining And Privacy Protection21

k-anonymity  Transform each QI value into a less specific form A generalized table AgeSexZipcodeDisease >21M1100*pneumonia >21M1100*dyspepsia >21M1100*dyspepsia >21M1100*pneumonia >61F1100*flu >61F1100*gastritis >61F1100*flu >61F1100*bronchitis NameAgeSexZipcode Bob23M11000 22

k-anonymity example tools for anonymization  generalization  publish more general values, i.e., given a domain hierarchy, roll-up  suppression  remove tuples, i.e., do not publish outliers  often the number of suppressed tuples is bounded BirthdateSexZipcode 21/1/79male53715 10/1/79female55410 1/10/44female90210 21/2/83male02274 19/4/82male02237 BirthdateSexZipcode group 1 */1/79person5**** */1/79person5**** suppressed1/10/44female90210 group 2 */*/8*male022** */*/8*male022** original data2-anonymous data Data Mining And Privacy Protection23

generalization lattice assume domain hierarchies exist for all QI attributes zipcode sex construct the generalization lattice for the entire QI set objective find the minimum generalization that satisfies k-anonymity generalization less more i.e., maximize utility by finding minimum distance vector with k- anonymity Data Mining And Privacy Protection

incognito Data Mining And Privacy Protection  The Incognito algorithm generates the set of all possible k-anonymous full-domain generalizations of T, with an optional tuple suppression threshold.  the algorithm begins by checking single-attribute subsets of the quasi-identifier, and then iterates, checking k- anonymity with respect to increasingly large subsets 25

incognito (I) generalization property if at some node k-anonymity holds, then it also holds for any ancestor node (II) subset property if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets e.g., is k-anonymous and, thus, so is and e.g., is not k-anonymous and, thus and cannot be k-anonymous incognito considers sets of QI attributes of increasing cardinality and prunes nodes in the lattice using the two properties above note: the entire lattice, which includes three dimensions, is too complex to show Data Mining And Privacy Protection

incognito Data Mining And Privacy Protection   ……  ……. 27

incognito Data Mining And Privacy Protection28

seen in the domain space  consider the multi-dimensional domain space  QI attributes are the dimensions  tuples are points in this space  attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy Data Mining And Privacy Protection29

seen in the domain space incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode sex zipcode not 2-anonymous 2-anonymous Data Mining And Privacy Protection

k-anonymity problems  k-anonymity example  homogeneity attack: in the last group everyone has cancer  background knowledge: in the first group, Japanese have low chance of heart disease  we need to consider the sensitive values id Zipcod e Age National.Disease 11305328RussianHeart Disease 21306829AmericanHeart Disease 31306821JapaneseViral Infection 41305323AmericanViral Infection 51485350IndianCancer 61485355RussianHeart Disease 71485047AmericanViral Infection 81485049AmericanViral Infection 91305331AmericanCancer 101305337IndianCancer 111306836JapaneseCancer 121306835AmericanCancer id Zipcod e Age National.Disease 1130**<30 ∗ Heart Disease 2130**<30 ∗ Heart Disease 3130**<30 ∗ Viral Infection 4130**<30 ∗ Viral Infection 51485*≥40 ∗ Cancer 61485*≥40 ∗ Heart Disease 71485*≥40 ∗ Viral Infection 81485*≥40 ∗ Viral Infection 9130** 3∗3∗∗ Cancer 10130** 3∗3∗∗ Cancer 11130** 3∗3∗∗ Cancer 12130** 3∗3∗∗ Cancer data4-anonymous data Data Mining And Privacy Protection32

l-diversity  make sure each group contains well represented sensitive values  protect from homogeneity attacks  protect from background knowledge l-diversity (simplified definition) a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group 33 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]flu [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 A 2-diverse generalized table

anatomy fast l-diversity algorithm anatomy is not generalization seperates sensitive values from tuples shuffles sensitive values among groups For a given data table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Group-IDDisease 1dyspepsia 1pneumonia 1flu 1gastritis 2bronchitis 2flu 2gastritis 2dyspepsia AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 Quasi-identifier Table (QIT) Sensitive Table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000flu 35M59000dyspepsia 59M12000gastritis 61F54000dyspepsia 65F25000gastritis 65F25000flu 70F30000bronchitis data 34

anatomy algorithm assign sensitive values to buckets create groups by drawing from l largest buckets 35

Privacy Preservation  From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000

Association Rule Hiding  Recent years have seen tremendous advances in the ability to perform association rule mining effectively  Such rules often encode important target marketing information about a business User  Changed  Database Data Mining Association Rules Hide Sensitive Rules Data Mining And Privacy Protection 38

Association Rule Hiding  There are various algorithms for hiding a group of association rules, which is characterized as sensitive.  One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold.  Sometimes, sensitive rules should not be disclosed to the public since, among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. Association Rule Hiding Techniques  Distortion-based: Modify entries from 1s to 0s  Blocking-based Technique the entry is not modified, but is left incomplete. Thus, unknown entry values are used to prevent discovery of association rules Data Mining And Privacy Protection39

Distortion-based Techniques ABCD 1110 1011 0001 1110 1011 Rule A →C has: Support(A→C)=80% Confidence(A→C)=100% Sample Database ABCD 1110 1001 0001 1110 1001 Distorted Database Rule A →C has now: Support(A→C)=40% Confidence(A→C)=50% Distortion Algorithm Data Mining And Privacy Protection 40

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 41

Association Rule Hiding Strategies Data Mining And Privacy Protection If we want to lower the value of the ratio: X Y ___ X Y X Y 42

Association Rule Hiding Strategies Data Mining And Privacy Protection Support = N is the number of transactions in D. 43 X  Y

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC  B Support=50%, conf=75% TIDItems T1110 T2111 T3111 T4110 T5100 T6101 Support=33%, conf=66% min_supp=35%, min_conf=70% 44

Association Rule Hiding Strategies Data Mining And Privacy Protection Confidence= Decrease the support, making sure we hide items from the right hand side of the rule Increase the support of the left hand. 45

Quality of Data  Sometimes it is dangerous to delete some items from the database (etc. medical databases) because the false data may create undesirable effects.  So, we have to hide the rules in the database by adding uncertainty without distorting the database. Data Mining And Privacy Protection48

Blocking-based Techniques ABCD 1110 1011 0001 1110 1011ABCD1110 10?1 ?001 1110 1011 Blocking Algorithm Initial Database New Database Data Mining And Privacy Protection49

Motivation  Setting  Data is distributed at different sites  These sites may be third parties (e.g., hospitals, government bodies) or individuals  Aim  Compute the data mining algorithm on the data so that nothing but the output is learned  That is, carry out a secure computation Data Mining And Privacy Protection51

Medical Records RPJYesDiabetic CACNo TumorNo PTRNo TumorDiabetic Cell Phone Data RPJ5210Li/Ion CACnone PTR3650NiCd Global Database View TIDBrain Tumor?Diabetes?ModelBattery Vertical Partitioning of Data Data Mining And Privacy Protection52

Horizontal partitioning -Two banks -Very similar information. credit card Is the account active? Is the account delinquent? Is the account new? account balance -No public Sharing Data Mining And Privacy Protection53

Privacy-Preserving Distributed Data Mining Data Mining And Privacy Protection  Secure multiparty computation  Cryptography 54

Data Mining And Privacy Protection55                

Data Mining And Privacy Protection56

Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.

Similar presentations

Presentation on theme: "Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.

Similar presentations

Presentation on theme: "Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk."— Presentation transcript:

Similar presentations

About project

Feedback