Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection2
Privacy Preserving Data Mining What is data mining? Non-trivial extraction of implicit, previously unknown, and potentially useful information from large data sets or databases [W. Frawley and G. Piatetsky-Shapiro and C. Matheus, 1992] What is privacy preserving data mining? Study of achieving some data mining goals without scarifying the privacy of the individuals Data Mining And Privacy Protection3
Scenario (Information Sharing) A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification analysis without scarifying the privacy of the individuals in the released data. Data owner Data recipients Person-specific data 4Data Mining And Privacy Protection
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection5
key directions in the field of privacy- preserving data mining Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy. Changing the results of Data Mining Applications to preserve privacy : In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the privacy of the data. Privacy-Preserving Distributed Data Mining : In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. 6 Data Mining And Privacy Protection
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection7
Randomization Approach Overview 50 | 40K |...30 | 25 | Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20 |...25 | 60K | Data Mining And Privacy Protection
Randomization The method of randomization can be described as follows. x={x 1 …x N }, For record x i X we add a noise component y 1 …y N, which is drawn from the probability distribution f Y (y), the new set of distorted records are x 1 +y 1 ….x N +y N In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data,Thus, the original records cannot be recovered, but the distribution of the original records can be recovered. 9Data Mining And Privacy Protection
Randomization 10Data Mining And Privacy Protection
Reconstruction Problem Original values x 1, x 2,..., x n from probability distribution X (unknown) To hide these values, we use y 1, y 2,..., y n from probability distribution Y (known) Given x 1 +y 1, x 2 +y 2,..., x n +y n the probability distribution of Y Estimate the probability distribution of X. 11Data Mining And Privacy Protection
Intuition (Reconstruct single point) 12Data Mining And Privacy Protection
Intuition (Reconstruct single point) 13Data Mining And Privacy Protection
Reconstructing the Distribution Combine estimates of where point came from for all the points: Gives estimate of original distribution. 14Data Mining And Privacy Protection
Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := (Bayes' rule) j := j+1 until (stopping criterion met) 15Data Mining And Privacy Protection
Seems to work well! 16Data Mining And Privacy Protection
Pros & Cons One key advantage of the randomization method is that it is relatively simple, and does not require knowledge of the distribution of other records in the data. 17Data Mining And Privacy Protection
Pros & Cons we only have a distribution containing the behavior of X. Individual records are not available. the distributions are available only along individual dimensions. While the approach can certainly be extended to multi-dimension distributions, density estimation becomes inherently more challenging with increasing dimensionalities. For even modest dimensionalities such as 7 to 10, the process of density estimation becomes increasingly inaccurate, and falls prey to the curse of dimensionality 18Data Mining And Privacy Protection
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection19
k-anonymity the role of attributes in data explicit identifiers are removed quasi identifiers can be used to re-identify individuals sensitive attributes (may not exist!) carry sensitive information NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS identifierquasi identifierssensitive NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS Data Mining And Privacy Protection20
k-anonymity preserve privacy via k-anonymity, proposed by Sweeney and Samarati k-anonymity: intuitively, hide each individual among k-1 others each QI set of values should appear at least k times in the released data sensitive attributes are not considered (going to revisit this...) how to achieve this? generalization and suppression value perturbation is not considered (we should remain truthful to original values ) privacy vs utility tradeoff do not anonymize more than necessary Data Mining And Privacy Protection21
k-anonymity Transform each QI value into a less specific form A generalized table AgeSexZipcodeDisease >21M1100*pneumonia >21M1100*dyspepsia >21M1100*dyspepsia >21M1100*pneumonia >61F1100*flu >61F1100*gastritis >61F1100*flu >61F1100*bronchitis NameAgeSexZipcode Bob23M
k-anonymity example tools for anonymization generalization publish more general values, i.e., given a domain hierarchy, roll-up suppression remove tuples, i.e., do not publish outliers often the number of suppressed tuples is bounded BirthdateSexZipcode 21/1/79male /1/79female /10/44female /2/83male /4/82male02237 BirthdateSexZipcode group 1 */1/79person5**** */1/79person5**** suppressed1/10/44female90210 group 2 */*/8*male022** */*/8*male022** original data2-anonymous data Data Mining And Privacy Protection23
generalization lattice assume domain hierarchies exist for all QI attributes zipcode sex construct the generalization lattice for the entire QI set objective find the minimum generalization that satisfies k-anonymity generalization less more i.e., maximize utility by finding minimum distance vector with k- anonymity Data Mining And Privacy Protection
incognito Data Mining And Privacy Protection The Incognito algorithm generates the set of all possible k-anonymous full-domain generalizations of T, with an optional tuple suppression threshold. the algorithm begins by checking single-attribute subsets of the quasi-identifier, and then iterates, checking k- anonymity with respect to increasingly large subsets 25
incognito (I) generalization property if at some node k-anonymity holds, then it also holds for any ancestor node (II) subset property if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets e.g., is k-anonymous and, thus, so is and e.g., is not k-anonymous and, thus and cannot be k-anonymous incognito considers sets of QI attributes of increasing cardinality and prunes nodes in the lattice using the two properties above note: the entire lattice, which includes three dimensions, is too complex to show Data Mining And Privacy Protection
incognito Data Mining And Privacy Protection …… ……. 27
incognito Data Mining And Privacy Protection28
seen in the domain space consider the multi-dimensional domain space QI attributes are the dimensions tuples are points in this space attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy Data Mining And Privacy Protection29
seen in the domain space incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode sex zipcode not 2-anonymous 2-anonymous Data Mining And Privacy Protection
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection31
k-anonymity problems k-anonymity example homogeneity attack: in the last group everyone has cancer background knowledge: in the first group, Japanese have low chance of heart disease we need to consider the sensitive values id Zipcod e Age National.Disease RussianHeart Disease AmericanHeart Disease JapaneseViral Infection AmericanViral Infection IndianCancer RussianHeart Disease AmericanViral Infection AmericanViral Infection AmericanCancer IndianCancer JapaneseCancer AmericanCancer id Zipcod e Age National.Disease 1130**<30 ∗ Heart Disease 2130**<30 ∗ Heart Disease 3130**<30 ∗ Viral Infection 4130**<30 ∗ Viral Infection 51485*≥40 ∗ Cancer 61485*≥40 ∗ Heart Disease 71485*≥40 ∗ Viral Infection 81485*≥40 ∗ Viral Infection 9130** 3∗3∗∗ Cancer 10130** 3∗3∗∗ Cancer 11130** 3∗3∗∗ Cancer 12130** 3∗3∗∗ Cancer data4-anonymous data Data Mining And Privacy Protection32
l-diversity make sure each group contains well represented sensitive values protect from homogeneity attacks protect from background knowledge l-diversity (simplified definition) a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group 33 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]flu [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 A 2-diverse generalized table
anatomy fast l-diversity algorithm anatomy is not generalization seperates sensitive values from tuples shuffles sensitive values among groups For a given data table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Group-IDDisease 1dyspepsia 1pneumonia 1flu 1gastritis 2bronchitis 2flu 2gastritis 2dyspepsia AgeSexZipcodeGroup-ID 23M M M M F F F F Quasi-identifier Table (QIT) Sensitive Table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000flu 35M59000dyspepsia 59M12000gastritis 61F54000dyspepsia 65F25000gastritis 65F25000flu 70F30000bronchitis data 34
anatomy algorithm assign sensitive values to buckets create groups by drawing from l largest buckets 35
Privacy Preservation From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection37
Association Rule Hiding Recent years have seen tremendous advances in the ability to perform association rule mining effectively Such rules often encode important target marketing information about a business User Changed Database Data Mining Association Rules Hide Sensitive Rules Data Mining And Privacy Protection 38
Association Rule Hiding There are various algorithms for hiding a group of association rules, which is characterized as sensitive. One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold. Sometimes, sensitive rules should not be disclosed to the public since, among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. Association Rule Hiding Techniques Distortion-based: Modify entries from 1s to 0s Blocking-based Technique the entry is not modified, but is left incomplete. Thus, unknown entry values are used to prevent discovery of association rules Data Mining And Privacy Protection39
Distortion-based Techniques ABCD Rule A →C has: Support(A→C)=80% Confidence(A→C)=100% Sample Database ABCD Distorted Database Rule A →C has now: Support(A→C)=40% Confidence(A→C)=50% Distortion Algorithm Data Mining And Privacy Protection 40
Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T
Association Rule Hiding Strategies Data Mining And Privacy Protection If we want to lower the value of the ratio: X Y ___ X Y X Y 42
Association Rule Hiding Strategies Data Mining And Privacy Protection Support = N is the number of transactions in D. 43 X Y
Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC B Support=50%, conf=75% TIDItems T1110 T2111 T3111 T4110 T5100 T6101 Support=33%, conf=66% min_supp=35%, min_conf=70% 44
Association Rule Hiding Strategies Data Mining And Privacy Protection Confidence= Decrease the support, making sure we hide items from the right hand side of the rule Increase the support of the left hand. 45
Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC B Support=50%, conf=75% TIDItems T T2111 T3111 T4110 T5100 T6101 Support=33%, conf=66% min_supp=33%, min_conf=70% 46
Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC B Support=50%, conf=75% TIDItems T1111 T2111 T3111 T4110 T5101 T6101 Support=50%, conf=60% min_supp=33%, min_conf=70% 47
Quality of Data Sometimes it is dangerous to delete some items from the database (etc. medical databases) because the false data may create undesirable effects. So, we have to hide the rules in the database by adding uncertainty without distorting the database. Data Mining And Privacy Protection48
Blocking-based Techniques ABCD ABCD ?1 ? Blocking Algorithm Initial Database New Database Data Mining And Privacy Protection49
Outline Introduction key directions in the field of privacy-preserving data mining Privacy-Preserving Data Publishing The randomization method The k-anonymity model l-diversity Changing the results of Data Mining Applications to preserve privacy association rule hiding Privacy-Preserving Distributed Data Mining Vertical Partitioning of Data Horizontal partitioning Data Mining And Privacy Protection50
Motivation Setting Data is distributed at different sites These sites may be third parties (e.g., hospitals, government bodies) or individuals Aim Compute the data mining algorithm on the data so that nothing but the output is learned That is, carry out a secure computation Data Mining And Privacy Protection51
Medical Records RPJYesDiabetic CACNo TumorNo PTRNo TumorDiabetic Cell Phone Data RPJ5210Li/Ion CACnone PTR3650NiCd Global Database View TIDBrain Tumor?Diabetes?ModelBattery Vertical Partitioning of Data Data Mining And Privacy Protection52
Horizontal partitioning -Two banks -Very similar information. credit card Is the account active? Is the account delinquent? Is the account new? account balance -No public Sharing Data Mining And Privacy Protection53
Privacy-Preserving Distributed Data Mining Data Mining And Privacy Protection Secure multiparty computation Cryptography 54
Data Mining And Privacy Protection55
Data Mining And Privacy Protection56