Privacy-Preserving Data Mining

Privacy-Preserving Data Mining
Representor : Li Cao October 15, 2003

Presentation organization
Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases

Privacy protection history Privacy concerns nowadays
Citizens’ attitude Scholars’ attitude

Internet users’ attitudes

Privacy value Filtering to weed out unwanted information
Better search results with less effort Useful recommendations Market trend ………… Example: From the analysis of a large number of purchase transaction records with the costumers’ age and income, we know what kind of costumers like some style or brand.

Motivation (Introducing Data Mining )
Data Mining’s goal: discover knowledge, trends, patterns from large amounts of data. Data Mining’s primary task: developing accurate models about aggregated data without access to precise information in individual data records. (Not only discovering knowledge but also preserving privacy)

Presentation organization
Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases

Privacy Preserving Data Mining scheme using random perturbation
Basic idea Reconstruction procedure Decision-Tree classification Three different algorithms

Attribute list of a example

Records of a example

Privacy preserving methods
Value-class Membership: the values for an attribute are partitioned into a set of disjoint, mutually-exclusive classes. Value Distortion: xi+r instead of xi where r is a random value. a) Uniform b) Gaussian

Basic idea Original data  Perturbed data (Let users provide a modified value for sensitive attributes) Estimate the distribution of original data from perturbed data Build classifiers by using these reconstructed distributions (Decision-tree)

Basic Steps

Reconstruction problem
View the n original data value x1,x2,…xn of a one-dimensional distribution as realizations of n independent identically distributed (iid) random variables X1,X2,…Xn, each with the same distribution as the random variable X. To hide these data values, n independent random variables Y1,Y2…Yn have been used, each with the same distribution as a different random variable Y.

Given x1+y1, x2+y2…xn+yn (where yi is the realization of Yi) and the cumulative distribution function Fy for Y, we would like to estimate the cumulative distribution function Fx for X. In short, given a cumulative distribution Fy and the realizations of n iid random samples X1+Y1, X2+Y2,…Xn+Yn, estimate Fx.

Reconstruction process
Let the value of Xi+Yi be wi (xi+yi). Use Bayes’ rule to estimate the posterior distribution function Fx1’ (given that X1+Y1=w1) for X1, assuming we know the density function fx and fy for X and Y respectively.

To estimate the posterior distribution function Fx’ given x1+y1, x2+y2…xn+yn, we average the distribution function for each of the Xi.

The corresponding posterior density function, fx’ is obtained by differentiating Fx’:
Given a sufficiently large number of samples, fx’ will be very close to the real density function fx.

Reconstruction algorithm
fx0 := Uniform distribution j :=0 //Iteration number Repeat j := j+1 until (stopping criterion met)

Stopping Criterion Observed randomized distribution ?= The result of randomizing the current estimate of the original distribution The difference between successive estimates of the original distribution is very small.

Reconstruction effect

Decision-Tree Classification
Tow stages: (1) Growth (2) Prune Example:

Tree-growth phase algorithm
Partition (Data S) begin if (most points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition (S1) Partition (S2) end

Choose the best split Information gain (categorical attributes)
Gini index (continuous attributes)

Gini index calculation
(pj is the relative frequency of class j in S) If a split divides S into two subsets S1 and S2 Note: Calculating this index requires only the distribution of the class values.

When & How original distribution are reconstructed
Global: Reconstruct the distribution for each attribute once. Decision-tree classification. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately. Decision-tree classification Local: The same as in ByClass, however, instead of doing reconstruction only once, reconstruction is done at each node.

Example (ByClass and Local)

Comparing the three algorithms
Execution Time Accuracy Global Cheapest Worst ByClass Middle Local Most expensive Best

Presentation Organization
Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases Are there any other classification methods available?

Privacy Preserving Data Mining using Randomized Response Techniques
Building Decision-Tree Key: Information Gain Calculation Experimental results

Randomized Response A survey contains a sensitive attribute A.
Instead of asking whether the respondent has the attribute A, ask two related questions, the answer to which are opposite to each other (have A  no A). Respondent use a randomizing device to decide which question to answer. The device is designed in such way that the probability of choosing the first question is θ .

To estimate the percentage of people who has the attribute A, we can use:
P’(A=yes) = P(A=yes)* θ +P(A=no)*(1- θ) P’(A=no) = P(A=no)* θ +P(A=yes)*(1- θ) P’(k=yes): The proportion of the “yes” responses obtained from the survey data. P(k=yes): The estimated proportion of the “yes” responses Our Goal: P(A=yes) and P(A=no)

Example: Sensitive attribute: Married? Two questions: A? Yes / No
B? No / Yes

Decision-Tree (Key: Info Gain)
m: m classes assumed Qj: The relative frequency of class j in S v: any possible value of attribute A Sv: The subset of S for which attribute A has value v. | Sv |: The number of elements in Sv |S|: The number of elements in S

P(E): The proportion of the records in the undisguised data set that satisfy E=true
P*(E): The proportion of the records in the disguised data set that satisfy E=true Assume the class label is binary. So the Entropy(S) can be calculated. Similarly, calculate |Sv|. At last, we get Gain(S,A).

Experimental results

Comparing these two cases
Perturbation Response Attribute Continuous Categorical Privacy Preserving method Value distortion Randomized response Choose attribute to split Gini index Information Gain Inverse procedure Reconstruct distribution Estimate P(E) from P’(E)

Future work Solve categorical problems by the first scheme
Solve continuous problems by the second scheme Combine these two scheme to solve some problems Other classification suitable

Privacy-Preserving Data Mining

Similar presentations

Presentation on theme: "Privacy-Preserving Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy-Preserving Data Mining

Similar presentations

Presentation on theme: "Privacy-Preserving Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback