Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Classification and Prediction
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Classification.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Gini Index (IBM IntelligentMiner)
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Chapter 7 Decision Tree.
Data Mining: Classification
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Chapter 9 – Classification and Regression Trees
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Basic Data Mining Technique
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Classification and Prediction
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Slide 1 Vitaly Shmatikov CS 380S Privacy-Preserving Data Mining.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
1 March 9, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 4 — Classification.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Classification and Prediction
Roberto Battiti, Mauro Brunato
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Classification 1.
Presentation transcript:

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity

February 12, What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence

Privacy preserving data mining Support data mining while preserving privacy Sensitive raw data Sensitive mining results

February 12, Seminal work Privacy preserving data mining, Agrawal and Srikant, 2000 Centralized data Data randomization (additive noise) Decision tree classifier Privacy preserving data mining, Lindell and Pinkas, 2000 Distributed data mining Secure multi-party computation Decision tree classifier

slide 5 Input Perturbation x1…xnx1…xn Reveal entire database, but randomize entries Database x1+1…xn+nx1+1…xn+n Add random noise  i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i User

February 12, Taxonomy of PPDM algorithms Data distribution Centralized Distributed – Privacy preserving distributed data mining Approaches Input perturbation – additive noise (randomization), multiplicative noise, generalization, swapping, sampling Output perturbation – rule hiding Crypto techniques – secure multiparty computation Data mining algorithms Classification Association rule mining Clustering

Randomization techniques Privacy preserving data mining, Agrawal and Srikant, 2000 Seminal work on decision tree classifier Limiting Privacy Breaches in Privacy- Preserving Data Mining, Evfimievski and Gehrke, 2003 Refined privacy definition Association rule mining

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00) Basic idea: Perturb Data with Value Distortion User provides x i +r instead of x i r is a random value Uniform, uniform distribution between [- ,  ] Gaussian, normal distribution with  = 0,  Hypothesis Miner doesn’t see the real data or can’t reconstruct real values Miner can reconstruct enough information to build decision tree for classification

Randomization Approach 50 | 40K |...30 | 70K | Randomizer Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age ?

February 12, Classification predicts categorical class labels (discrete or nominal) Prediction (Regression) models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval Target marketing Medical diagnosis Fraud detection Classification

Li Xiong11 Motivating Example for Classification – Fruit Identification … DangerousHardSmallSmooth SafeSoftLargeGreenHairy DangerousSoftRedSmooth SafeHardLargeGreenHairy safeHardLargeBrownHairy ConclusionFleshSizeColorSkin Large Red

February 12, Another Example – Credit Approval Classification rule: If age = “ ” and income = high then credit_rating = excellent Future customers Paul: age = 35, income = high  excellent credit rating John: age = 20, income = medium  fair credit rating NameAgeIncome…Credit Clark35High…Excellent Milton38High…Excellent Neo25Medium…Fair ……………

February 12, 2008Data Mining: Concepts and Techniques13 Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects

February 12, 2008Data Mining: Concepts and Techniques14 Training Dataset

February 12, 2008Data Mining: Concepts and Techniques15 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? <=30 >40 noyes no fairexcellent yesno

February 12, 2008Data Mining: Concepts and Techniques16 Algorithm for Decision Tree Induction ID3 (Iterative Dichotomiser), C4.5 CART (Classification and Regression Trees) Basic algorithm (a greedy algorithm) - tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions Heuristic or statistical measure Samples are partitioned recursively based on selected attributes Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

February 12, 2008Data Mining: Concepts and Techniques17 Attribute Selection Measures Idea: select attribute that partition samples into homogeneous groups Measures Information gain (ID3) Gain ratio (C4.5) Gini index (CART)

February 12, 2008Data Mining: Concepts and Techniques18 Attribute Selection Measure: Information Gain (ID3) Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by |C i, D |/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gain – difference between original information requirement and the new information requirement by branching on attribute A

February 12, 2008Data Mining: Concepts and Techniques19 Attribute Selection Measure: Gini index (CART) If a data set D contains examples from n classes, gini index, gini(D) is defined as where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node

February 12, 2008Data Mining: Concepts and Techniques20 Information-Gain for Continuous-Value Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point

Randomization Approach 50 | 40K |...30 | 70K | Randomizer Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age ?

February 12, 2008Data Mining: Concepts and Techniques22 Attribute Selection Measure: Gini index (CART) If a data set D contains examples from n classes, gini index, gini(D) is defined as where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node

Randomization Approach Overview 50 | 40K |...30 | 70K | Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age

Original Distribution Reconstruction x 1, x 2, …, x n are the n original data values Drawn from n iid random variables X 1, X 2, …, X n similar to X Using value distortion, The given values are w 1 = x 1 + y 1, w 2 = x 2 + y 2, …, w n = x n + y n y i ’s are from n iid random variables Y 1, Y 2, …, Y n similar to Y Reconstruction Problem: Given F Y and w i ’s, estimate F X

Original Distribution Reconstruction: Method Bayes’ theorem for continuous distribution The estimated density function: Iterative estimation The initial estimate for f X at j=0: uniform distribution Iterative estimation Stopping Criterion:  2 test between successive iterations

Reconstruction of Distribution

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree When to reconstruct distributions? Global Reconstruct for each attribute once at the beginning Build the decision tree using the reconstructed data ByClass First split the training data Reconstruct for each class separately Build the decision tree using the reconstructed data Local First split the training data Reconstruct for each class separately Reconstruct at each node while building the tree

Accuracy vs. Randomization Level

More Results Global performs worse than ByClass and Local ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy Overall, all are much better than the Randomized accuracy

Privacy level Is the privacy level sufficiently measured?

slide 32 How to Measure Privacy Breach Weak: no single database entry has been revealed Stronger: no single piece of information is revealed (what’s the difference from the “weak” version?) Strongest: the adversary’s beliefs about the data have not changed

slide 33 Kullback-Leibler Distance Measures the “difference” between two probability distributions

slide 34 Privacy of Input Perturbation X is a random variable, R is the randomization operator, Y=R(X) is the perturbed database Measure mutual information between original and randomized databases Average KL distance between (1) distribution of X and (2) distribution of X conditioned on Y=y E y (KL(P X|Y=y || P x )) Intuition: if this distance is small, then Y leaks little information about actual values of X Why is this definition problematic?

slide 35 Is the randomization sufficient? Gladys: 85 Doris: 90 Beryl: 82 Name: Age database Gladys: 72 Doris: 110 Beryl: 85 Age is an integer between 0 and 90 Randomize database entries by adding random integers between -20 and 20 Randomization operator has to be public (why?) Doris’s age is 90!!

slide 36 Privacy Definitions Mutual information can be small on average, but an individual randomized value can still leak a lot of information about the original value Better: consider some property Q(x) Adversary has a priori probability P i that Q(x i ) is true Privacy breach if revealing y i =R(x i ) significantly changes adversary’s probability that Q(x i ) is true Intuition: adversary learned something about entry x i (namely, likelihood of property Q holding for this entry)

slide 37 Example Data: 0  x  1000, p(x=0)=0.01, p(x=k)= Reveal y=R(x) Three possible randomization operators R R 1 (x) = x with prob. 20%; a uniformly random number with prob. 80% R 2 (x) = x+  mod 1001,  uniform in [-100,100] R 3 (x) = R 2 (x) with prob. 50%, a uniform random number with prob. 50% Which randomization operator is better?

slide 38 Some Properties Q 1 (x): x=0; Q 2 (x): x  {200,..., 800} What are the a priori probabilities for a given x that these properties hold? Q 1 (x): 1%, Q 2 (x): 40.5% Now suppose adversary learned that y=R(x)=0. What are probabilities of Q 1 (x) and Q 2 (x)? If R = R 1 then Q 1 (x): 71.6%, Q 2 (x): 83% If R = R 2 then Q 1 (x): 4.8%, Q 2 (x): 100% If R = R 3 then Q 1 (x): 2.9%, Q 2 (x): 70.8%

slide 39 Privacy Breaches R 1 (x) leaks information about property Q 1 (x) Before seeing R 1 (x), adversary thinks that probability of x=0 is only 1%, but after noticing that R 1 (x)=0, the probability that x=0 is 72% R 2 (x) leaks information about property Q 2 (x) Before seeing R 2 (x), adversary thinks that probability of x  {200,..., 800} is 41%, but after noticing that R 2 (x)=0, the probability that x  {200,..., 800} is 100% Randomization operator should be such that posterior distribution is close to the prior distribution for any property

slide 40 Privacy Breach: Definitions Q(x) is some property,  1,  2 are probabilities  1  “very unlikely”,  2  “very likely” Straight privacy breach: P(Q(x))   1, but P(Q(x) | R(x)=y)   2 Q(x) is unlikely a priori, but likely after seeing randomized value of x Inverse privacy breach: P(Q(x))   2, but P(Q(x) | R(x)=y)   1 Q(x) is likely a priori, but unlikely after seeing randomized value of x [Evfimievski et al.]

slide 41 How to check privacy breach How to ensure that randomization operator hides every property? There are 2 |X| properties Often randomization operator has to be selected even before distribution P x is known (why?) Idea: look at operator’s transition probabilities How likely is x i to be mapped to a given y? Intuition: if all possible values of x i are equally likely to be randomized to a given y, then revealing y=R(x i ) will not reveal much about actual value of x i

slide 42 Amplification Randomization operator is  -amplifying for y if For given  1,  2, no straight or inverse privacy breaches occur if [Evfimievski et al.]

slide 43 Amplification: Example R 1 (x) = x with prob. 20%; a uniformly random number with prob. 80% R 2 (x) = x+  mod 1001,  uniform in [-100,100] R 3 (x) = R 2 (x) with prob. 50%, a uniform random number with prob. 50% For R 3, p(x  y) = ½ (1/ /1001) if y  [x-100,x+100] ½(0 + 1/1001) otherwise Fractional difference = /201 < 6 (=  ) Therefore, no straight or inverse privacy breaches will occur with  1 =14%,  2 =50%

Coming up Multiplicative noise Output perturbation

February 12, 2008Data Mining: Concepts and Techniques45 Example: Information Gain  Class P: buys_computer = “yes”,  Class N: buys_computer = “no”