Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper
Privacy-Preserving Data Mining Problem: How do we publish data without compromising individual privacy? Solution : randomization, anonymization
Randomization Adding random noise to original dataset Challenge – Is data still useful for further analysis?
Randomization Model: data is distorted by adding random noise Original data X = {x 1...x N }, for record x i ∈ X, random variable Y = {y 1...y N } is added, so new data is denoted by Z ={ z 1...z N }, z i =x i + y i. y i is a random value – Uniform, [-α, +α] – Gaussian, N (0, σ 2 )
Reconstruction Perturbed data hides data distribution and need be reconstructed before data mining Given – x 1 +y 1, x 2 +y 2,..., x n +y n – the probability distribution of Y Estimate the probability distribution of x Clifton AusDM‘11
1.f x 0 = Uniform distribution 2.Repeat update until stop criterion met Reconstruction Bayes rule to estimate cumulative density functions reconstruction algorithm
reconstructed original randomized original reconstructed randomized N(0, 0.25) (-0.5, 0.5)
Privacy Metric If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence. Example Age 20-40, 95% confidence, 50% privacy in Uniform 2 α = 20*0.5/0.95 = 10.5 Confidence 50%95%99.9% Uniform0.5 X 2α0.95 X 2α0.999 X 2α Gaussian1.34 X σ3.92 X σ6.8 X σ
Decision Tree
Training Decision Tree Split point – interval boundaries Reconstruction algorithm – Global – Byclass – Local Dataset – Synthetic dataset, training set of 100,000 records and testing set of 5,000 records, equally split into two classes
original global and randomized Byclass and local global randomized original byclass local
Extended Work ‘02 proposed a method to quantify information loss – Mutual information ‘07 evaluated randomization with combining of public information – Gaussian is better than uniform – Dataset with inherent cluster pattern will improve randomization performance – Varying density and outliers will decrease performance
Multiplicative Randomization Rotation randomization – Distorted by an orthogonal matrix Projection randomization – Project high-dimensional dataset into low- dimensional space Preserving Euclidean distance and can be applied with distance-based classification (KNN, SVM) and clustering (K-means)
Summary Pros: data and noise are independent, can be applied during data collection time, useful for stream data Cons: information loss, dimensionality curse
Questions?