Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.

Slides:

Advertisements

Similar presentations

M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.

Advertisements

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.

Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.

Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.

Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Probabilistic Inference Protection on Anonymized Data

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.

SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.

Fast Algorithms for Association Rule Mining

L-Diversity: Privacy Beyond K-Anonymity

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Mining Association Rules

Data Mining – Intro.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.

Preserving Privacy in Clickstreams Isabelle Stanton.

Database Laboratory Regular Seminar TaeHoon Kim.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Preserving Privacy in Published Data

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

Publishing Microdata with a Robust Privacy Guarantee

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.

Protecting Sensitive Labels in Social Network Data Anonymization.

Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.

Refined privacy models

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.

K-Anonymity & Algorithms

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Additive Data Perturbation: the Basic Problem and Techniques.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Privacy-preserving data publishing

Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

CSCI 347, Data Mining Data Anonymization.

Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.

1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)

Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,

Privacy Preserving in Social Network Based System PRENTER: YI LIANG.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,

Executive Director and Endowed Chair

Executive Director and Endowed Chair

Farzaneh Mirzazadeh Fall 2007

Presented by : SaiVenkatanikhil Nimmagadda

Refined privacy models

Presentation transcript:

Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection2

Privacy Preserving Data Mining What is data mining? Non-trivial extraction of implicit, previously unknown, and potentially useful information from large data sets or databases [W. Frawley and G. Piatetsky-Shapiro and C. Matheus, 1992] What is privacy preserving data mining? Study of achieving some data mining goals without scarifying the privacy of the individuals Data Mining And Privacy Protection3

Scenario (Information Sharing) A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification analysis without scarifying the privacy of the individuals in the released data. Data owner Data recipients Person-specific data 4Data Mining And Privacy Protection

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection5

key directions in the field of privacy- preserving data mining  Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy.  Changing the results of Data Mining Applications to preserve privacy : In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the privacy of the data.  Privacy-Preserving Distributed Data Mining : In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. 6 Data Mining And Privacy Protection

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection7

Randomization Approach Overview 50 | 40K |...30 | 25 | Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20 |...25 | 60K | Data Mining And Privacy Protection

Randomization The method of randomization can be described as follows.  x={x 1 …x N }, For record x i X  we add a noise component y 1 …y N, which is drawn from the probability distribution f Y (y),  the new set of distorted records are x 1 +y 1 ….x N +y N  In general, it is assumed that the variance of the added noise is large enough, so that the original record values cannot be easily guessed from the distorted data,Thus, the original records cannot be recovered, but the distribution of the original records can be recovered. 9Data Mining And Privacy Protection

Randomization 10Data Mining And Privacy Protection

Reconstruction Problem  Original values x 1, x 2,..., x n  from probability distribution X (unknown)  To hide these values, we use y 1, y 2,..., y n  from probability distribution Y (known)  Given  x 1 +y 1, x 2 +y 2,..., x n +y n  the probability distribution of Y Estimate the probability distribution of X. 11Data Mining And Privacy Protection

Intuition (Reconstruct single point) 12Data Mining And Privacy Protection

Intuition (Reconstruct single point) 13Data Mining And Privacy Protection

Reconstructing the Distribution  Combine estimates of where point came from for all the points:  Gives estimate of original distribution. 14Data Mining And Privacy Protection

Reconstruction f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := (Bayes' rule) j := j+1 until (stopping criterion met) 15Data Mining And Privacy Protection

Seems to work well! 16Data Mining And Privacy Protection

Pros & Cons  One key advantage of the randomization method is that it is relatively simple, and does not require knowledge of the distribution of other records in the data. 17Data Mining And Privacy Protection

Pros & Cons  we only have a distribution containing the behavior of X. Individual records are not available.  the distributions are available only along individual dimensions.  While the approach can certainly be extended to multi-dimension distributions, density estimation becomes inherently more challenging with increasing dimensionalities. For even modest dimensionalities such as 7 to 10, the process of density estimation becomes increasingly inaccurate, and falls prey to the curse of dimensionality 18Data Mining And Privacy Protection

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection19

k-anonymity  the role of attributes in data  explicit identifiers are removed  quasi identifiers can be used to re-identify individuals  sensitive attributes (may not exist!) carry sensitive information NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS identifierquasi identifierssensitive NameBirthdateSexZipcodeDisease Andre21/1/79male53715Flu Beth10/1/81female55410Hepatitis Carol1/10/44female90210Brochitis Dan21/2/84male02174 Sprained Ankle Ellen19/4/72female02237AIDS Data Mining And Privacy Protection20

k-anonymity  preserve privacy via k-anonymity, proposed by Sweeney and Samarati  k-anonymity: intuitively, hide each individual among k-1 others  each QI set of values should appear at least k times in the released data  sensitive attributes are not considered (going to revisit this...)  how to achieve this?  generalization and suppression  value perturbation is not considered (we should remain truthful to original values )  privacy vs utility tradeoff  do not anonymize more than necessary Data Mining And Privacy Protection21

k-anonymity  Transform each QI value into a less specific form A generalized table AgeSexZipcodeDisease >21M1100*pneumonia >21M1100*dyspepsia >21M1100*dyspepsia >21M1100*pneumonia >61F1100*flu >61F1100*gastritis >61F1100*flu >61F1100*bronchitis NameAgeSexZipcode Bob23M

k-anonymity example tools for anonymization  generalization  publish more general values, i.e., given a domain hierarchy, roll-up  suppression  remove tuples, i.e., do not publish outliers  often the number of suppressed tuples is bounded BirthdateSexZipcode 21/1/79male /1/79female /10/44female /2/83male /4/82male02237 BirthdateSexZipcode group 1 */1/79person5**** */1/79person5**** suppressed1/10/44female90210 group 2 */*/8*male022** */*/8*male022** original data2-anonymous data Data Mining And Privacy Protection23

generalization lattice assume domain hierarchies exist for all QI attributes zipcode sex construct the generalization lattice for the entire QI set objective find the minimum generalization that satisfies k-anonymity generalization less more i.e., maximize utility by finding minimum distance vector with k- anonymity Data Mining And Privacy Protection

incognito Data Mining And Privacy Protection  The Incognito algorithm generates the set of all possible k-anonymous full-domain generalizations of T, with an optional tuple suppression threshold.  the algorithm begins by checking single-attribute subsets of the quasi-identifier, and then iterates, checking k- anonymity with respect to increasingly large subsets 25

incognito (I) generalization property if at some node k-anonymity holds, then it also holds for any ancestor node (II) subset property if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets e.g., is k-anonymous and, thus, so is and e.g., is not k-anonymous and, thus and cannot be k-anonymous incognito considers sets of QI attributes of increasing cardinality and prunes nodes in the lattice using the two properties above note: the entire lattice, which includes three dimensions, is too complex to show Data Mining And Privacy Protection

incognito Data Mining And Privacy Protection   ……  ……. 27

incognito Data Mining And Privacy Protection28

seen in the domain space  consider the multi-dimensional domain space  QI attributes are the dimensions  tuples are points in this space  attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy Data Mining And Privacy Protection29

seen in the domain space incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode sex zipcode not 2-anonymous 2-anonymous Data Mining And Privacy Protection

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection31

k-anonymity problems  k-anonymity example  homogeneity attack: in the last group everyone has cancer  background knowledge: in the first group, Japanese have low chance of heart disease  we need to consider the sensitive values id Zipcod e Age National.Disease RussianHeart Disease AmericanHeart Disease JapaneseViral Infection AmericanViral Infection IndianCancer RussianHeart Disease AmericanViral Infection AmericanViral Infection AmericanCancer IndianCancer JapaneseCancer AmericanCancer id Zipcod e Age National.Disease 1130**<30 ∗ Heart Disease 2130**<30 ∗ Heart Disease 3130**<30 ∗ Viral Infection 4130**<30 ∗ Viral Infection 51485*≥40 ∗ Cancer 61485*≥40 ∗ Heart Disease 71485*≥40 ∗ Viral Infection 81485*≥40 ∗ Viral Infection 9130** 3∗3∗∗ Cancer 10130** 3∗3∗∗ Cancer 11130** 3∗3∗∗ Cancer 12130** 3∗3∗∗ Cancer data4-anonymous data Data Mining And Privacy Protection32

l-diversity  make sure each group contains well represented sensitive values  protect from homogeneity attacks  protect from background knowledge l-diversity (simplified definition) a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group 33 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]flu [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 A 2-diverse generalized table

anatomy fast l-diversity algorithm anatomy is not generalization seperates sensitive values from tuples shuffles sensitive values among groups For a given data table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST) Group-IDDisease 1dyspepsia 1pneumonia 1flu 1gastritis 2bronchitis 2flu 2gastritis 2dyspepsia AgeSexZipcodeGroup-ID 23M M M M F F F F Quasi-identifier Table (QIT) Sensitive Table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000flu 35M59000dyspepsia 59M12000gastritis 61F54000dyspepsia 65F25000gastritis 65F25000flu 70F30000bronchitis data 34

anatomy algorithm assign sensitive values to buckets create groups by drawing from l largest buckets 35

Privacy Preservation  From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M M M M F F F F quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection37

Association Rule Hiding  Recent years have seen tremendous advances in the ability to perform association rule mining effectively  Such rules often encode important target marketing information about a business User  Changed  Database Data Mining Association Rules Hide Sensitive Rules Data Mining And Privacy Protection 38

Association Rule Hiding  There are various algorithms for hiding a group of association rules, which is characterized as sensitive.  One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold.  Sometimes, sensitive rules should not be disclosed to the public since, among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. Association Rule Hiding Techniques  Distortion-based: Modify entries from 1s to 0s  Blocking-based Technique the entry is not modified, but is left incomplete. Thus, unknown entry values are used to prevent discovery of association rules Data Mining And Privacy Protection39

Distortion-based Techniques ABCD Rule A →C has: Support(A→C)=80% Confidence(A→C)=100% Sample Database ABCD Distorted Database Rule A →C has now: Support(A→C)=40% Confidence(A→C)=50% Distortion Algorithm Data Mining And Privacy Protection 40

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T

Association Rule Hiding Strategies Data Mining And Privacy Protection If we want to lower the value of the ratio: X Y ___ X Y X Y 42

Association Rule Hiding Strategies Data Mining And Privacy Protection Support = N is the number of transactions in D. 43 X  Y

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC  B Support=50%, conf=75% TIDItems T1110 T2111 T3111 T4110 T5100 T6101 Support=33%, conf=66% min_supp=35%, min_conf=70% 44

Association Rule Hiding Strategies Data Mining And Privacy Protection Confidence= Decrease the support, making sure we hide items from the right hand side of the rule Increase the support of the left hand. 45

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC  B Support=50%, conf=75% TIDItems T T2111 T3111 T4110 T5100 T6101 Support=33%, conf=66% min_supp=33%, min_conf=70% 46

Association Rule Hiding Strategies Data Mining And Privacy Protection TIDItems T1ABC T2ABC T3ABC T4AB T5A T6AC TIDItems T1111 T2111 T3111 T4110 T5100 T6101 AC  B Support=50%, conf=75% TIDItems T1111 T2111 T3111 T4110 T5101 T6101 Support=50%, conf=60% min_supp=33%, min_conf=70% 47

Quality of Data  Sometimes it is dangerous to delete some items from the database (etc. medical databases) because the false data may create undesirable effects.  So, we have to hide the rules in the database by adding uncertainty without distorting the database. Data Mining And Privacy Protection48

Blocking-based Techniques ABCD ABCD ?1 ? Blocking Algorithm Initial Database New Database Data Mining And Privacy Protection49

Outline  Introduction  key directions in the field of privacy-preserving data mining  Privacy-Preserving Data Publishing  The randomization method  The k-anonymity model  l-diversity  Changing the results of Data Mining Applications to preserve privacy  association rule hiding  Privacy-Preserving Distributed Data Mining  Vertical Partitioning of Data  Horizontal partitioning Data Mining And Privacy Protection50

Motivation  Setting  Data is distributed at different sites  These sites may be third parties (e.g., hospitals, government bodies) or individuals  Aim  Compute the data mining algorithm on the data so that nothing but the output is learned  That is, carry out a secure computation Data Mining And Privacy Protection51

Medical Records RPJYesDiabetic CACNo TumorNo PTRNo TumorDiabetic Cell Phone Data RPJ5210Li/Ion CACnone PTR3650NiCd Global Database View TIDBrain Tumor?Diabetes?ModelBattery Vertical Partitioning of Data Data Mining And Privacy Protection52

Horizontal partitioning -Two banks -Very similar information. credit card Is the account active? Is the account delinquent? Is the account new? account balance -No public Sharing Data Mining And Privacy Protection53

Privacy-Preserving Distributed Data Mining Data Mining And Privacy Protection  Secure multiparty computation  Cryptography 54

Data Mining And Privacy Protection55                

Data Mining And Privacy Protection56