Privacy-Preserving Data Mining

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Classification Techniques: Decision Tree Learning
Lecture Notes for Chapter 4 Introduction to Data Mining
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Ensemble Learning: An Introduction
Induction of Decision Trees
Evaluating Hypotheses
Lecture 5 (Classification with Decision Trees)
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Ensemble Learning (2), Tree and Forest
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Data Mining: Classification
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Learning from Observations Chapter 18 Through
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Classification and Regression Trees
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Data Transformation: Normalization
Chapter 7. Classification and Prediction
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Decision Trees (suggested time: 30 min)
Ch9: Decision Trees 9.1 Introduction A decision tree:
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Spatial Online Sampling and Aggregation
K Nearest Neighbor Classification
Introduction to Data Mining, 2nd Edition by
Classification by Decision Tree Induction
Roberto Battiti, Mauro Brunato
MIS2502: Data Analytics Classification using Decision Trees
Differential Privacy (2)
Statistical Learning Dong Liu Dept. EEIS, USTC.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Decision Trees for Mining Data Streams
Chapter 7: Transformations
©Jiawei Han and Micheline Kamber
A task of induction to find patterns
A task of induction to find patterns
Machine Learning: Lecture 5
Presentation transcript:

Privacy-Preserving Data Mining Representor : Li Cao October 15, 2003

Presentation organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases

Privacy protection history Privacy concerns nowadays Citizens’ attitude Scholars’ attitude

Internet users’ attitudes

Privacy value Filtering to weed out unwanted information Better search results with less effort Useful recommendations Market trend ………… Example: From the analysis of a large number of purchase transaction records with the costumers’ age and income, we know what kind of costumers like some style or brand.

Motivation (Introducing Data Mining ) Data Mining’s goal: discover knowledge, trends, patterns from large amounts of data. Data Mining’s primary task: developing accurate models about aggregated data without access to precise information in individual data records. (Not only discovering knowledge but also preserving privacy)

Presentation organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases

Privacy Preserving Data Mining scheme using random perturbation Basic idea Reconstruction procedure Decision-Tree classification Three different algorithms

Attribute list of a example

Records of a example

Privacy preserving methods Value-class Membership: the values for an attribute are partitioned into a set of disjoint, mutually-exclusive classes. Value Distortion: xi+r instead of xi where r is a random value. a) Uniform b) Gaussian

Basic idea Original data  Perturbed data (Let users provide a modified value for sensitive attributes) Estimate the distribution of original data from perturbed data Build classifiers by using these reconstructed distributions (Decision-tree)

Basic Steps

Reconstruction problem View the n original data value x1,x2,…xn of a one-dimensional distribution as realizations of n independent identically distributed (iid) random variables X1,X2,…Xn, each with the same distribution as the random variable X. To hide these data values, n independent random variables Y1,Y2…Yn have been used, each with the same distribution as a different random variable Y.

Given x1+y1, x2+y2…xn+yn (where yi is the realization of Yi) and the cumulative distribution function Fy for Y, we would like to estimate the cumulative distribution function Fx for X. In short, given a cumulative distribution Fy and the realizations of n iid random samples X1+Y1, X2+Y2,…Xn+Yn, estimate Fx.

Reconstruction process Let the value of Xi+Yi be wi (xi+yi). Use Bayes’ rule to estimate the posterior distribution function Fx1’ (given that X1+Y1=w1) for X1, assuming we know the density function fx and fy for X and Y respectively.

To estimate the posterior distribution function Fx’ given x1+y1, x2+y2…xn+yn, we average the distribution function for each of the Xi.

The corresponding posterior density function, fx’ is obtained by differentiating Fx’: Given a sufficiently large number of samples, fx’ will be very close to the real density function fx.

Reconstruction algorithm fx0 := Uniform distribution j :=0 //Iteration number Repeat j := j+1 until (stopping criterion met)

Stopping Criterion Observed randomized distribution ?= The result of randomizing the current estimate of the original distribution The difference between successive estimates of the original distribution is very small.

Reconstruction effect

Decision-Tree Classification Tow stages: (1) Growth (2) Prune Example:

Tree-growth phase algorithm Partition (Data S) begin if (most points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition (S1) Partition (S2) end

Choose the best split Information gain (categorical attributes) Gini index (continuous attributes)

Gini index calculation (pj is the relative frequency of class j in S) If a split divides S into two subsets S1 and S2 Note: Calculating this index requires only the distribution of the class values.

When & How original distribution are reconstructed Global: Reconstruct the distribution for each attribute once. Decision-tree classification. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately. Decision-tree classification Local: The same as in ByClass, however, instead of doing reconstruction only once, reconstruction is done at each node.

Example (ByClass and Local)

Comparing the three algorithms Execution Time Accuracy Global Cheapest Worst ByClass Middle Local Most expensive Best

Presentation Organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases Are there any other classification methods available?

Privacy Preserving Data Mining using Randomized Response Techniques Building Decision-Tree Key: Information Gain Calculation Experimental results

Randomized Response A survey contains a sensitive attribute A. Instead of asking whether the respondent has the attribute A, ask two related questions, the answer to which are opposite to each other (have A  no A). Respondent use a randomizing device to decide which question to answer. The device is designed in such way that the probability of choosing the first question is θ .

To estimate the percentage of people who has the attribute A, we can use: P’(A=yes) = P(A=yes)* θ +P(A=no)*(1- θ) P’(A=no) = P(A=no)* θ +P(A=yes)*(1- θ) P’(k=yes): The proportion of the “yes” responses obtained from the survey data. P(k=yes): The estimated proportion of the “yes” responses Our Goal: P(A=yes) and P(A=no)

Example: Sensitive attribute: Married? Two questions: A? Yes / No B? No / Yes

Decision-Tree (Key: Info Gain) m: m classes assumed Qj: The relative frequency of class j in S v: any possible value of attribute A Sv: The subset of S for which attribute A has value v. | Sv |: The number of elements in Sv |S|: The number of elements in S

P(E): The proportion of the records in the undisguised data set that satisfy E=true P*(E): The proportion of the records in the disguised data set that satisfy E=true Assume the class label is binary. So the Entropy(S) can be calculated. Similarly, calculate |Sv|. At last, we get Gain(S,A).

Experimental results

Comparing these two cases Perturbation Response Attribute Continuous Categorical Privacy Preserving method Value distortion Randomized response Choose attribute to split Gini index Information Gain Inverse procedure Reconstruct distribution Estimate P(E) from P’(E)

Future work Solve categorical problems by the first scheme Solve continuous problems by the second scheme Combine these two scheme to solve some problems Other classification suitable