Presentation is loading. Please wait.

Presentation is loading. Please wait.

Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.

Similar presentations


Presentation on theme: "Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012."— Presentation transcript:

1 Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

2 2 Scope

3 3 Outline Part I: Randomization for Numerical Data Additive noise Projection Modeling based Part II: Randomization for Categorical Data Randomized Response Application to Market Basket Data Analysis

4 4 Additive Noise Randomization Example Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k 10852 157018 50 45 80 120 23 110 35 134 15 = + Y = X + E 7.3343.7590.099 4.1997.5377.939 9.199 6.208 9.048 8.447 7.313 5.692 3.678 1.939 6.318 17.33488.7592.099 19.19977.53725.939 59.199 51.208 89.048 128.447 30.313 115.692 38.678 135.939 21.318

5 5 Additive Randomization (Z=X+Y) 50 | 40K |...30 | 70K |...... Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K |...... 30 becomes 65 (30+35) Alice’s age Add random number to Age R.Agrawal and R.Srikant SIGMOD 00

6 6 Reconstruction Problem Original values x 1, x 2,..., x n from probability distribution X (unknown) To hide these values, we use y 1, y 2,..., y n from probability distribution Y Given x 1 +y 1, x 2 +y 2,..., x n +y n the probability distribution of Y Estimate the probability distribution of X.

7 7 Converges to maximum likelihood estimate Agrawal and Aggarwal PODS 01 Extension to muti-variate case Domingo-Ferrer et al. PSD04 Distribution Reconstruction Alg. Bootstrapping Algorithm

8 8 Works well

9 9 Individual Value Reconstruction (Additive Noise) Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06 All aim to remove noise by projecting on lower dimensional space.

10 10 Individual Reconstruction Algorithm Apply EVD : Using some published information about V, extract the first k components of as the principal components.  λ 1 ≥ λ 2 ··· ≥ λ k ≥ λ e and e 1, e 2, · · ·,e k are the corresponding eigenvectors.  Q k = [e 1 e 2 · · · e k ] forms an orthonormal basis of a subspace X. Find the orthogonal projection on to X : Get estimate data set: Up = U + V Noise Perturbed Original

11 11 EVD = x x A Λ Q QTQT eigenvalue ( λ k ) left eigenvector ( e k ) right eigenvector ( e k T )

12 12 Why it works Original data are correlated Noise are not correlated noise 2 nd principal vector 1 st principal vector original signal perturbed + = 2-d estimation 1-d estimation

13 13 SVD Reconstruction Input:, a given perturbed data set, a noise data set Output:, a reconstructed data BEGIN 1Apply SVD on to get 2Apply SVD on and assume is the largest singular value 3Determine the first k components of by 4Reconstructing the data as END

14 14 Additive Noise vs. Projection Additive perturbation is not safe Spectral Filtering Technique  H.Kargupta et al. ICDM03 PCA Based Technique  Huang et al. SIGMOD05 SVD based & Bound Analysis  Guo et al. SAC06,PKDD06 How about the projection based perturbation? Projection models Vulnerabilities Potential attacks X X E E Y Y NoisePerturbed Original = + R R X X Y Y Perturbed Transformation Original =

15 15 Rotation Randomization Example 0.33330.6667 -0.66670.6667-0.3333 -0.6667-0.33330.6667 1015504580 857012023110 2183513415 61.3363.67110.00119.6763.33 49.3330.6755.00-59.33-31.67 -33.67-21.33-30.0051.67-51.67 = Y = R X Balincome…IntP 110k85k…2k 215k70k…18k 350k120k…35k 445k23k…134k...…. N80k110k…15k RR T = R T R = I

16 16 Rotation Approach (R is orthonormal) When R is an orthonormal matrix (R T R = RR T = I) Vector length: |Rx| = |x| Euclidean distance: |Rx i - Rx j | = |x i - x j | Inner product : = Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06

17 17 Example 0.2902 1.3086 RR T = R T R = I

18 18 Weakness of Rotation 0.2902 1.3086 0.2902 1.3086 ? Regression Known sample attack Known Info Original data

19 19 General Linear Transformation Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RR T = R T R = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix 4.7512.4292.282 1.1564.4570.093 3.0343.8114.107 1015504580 857012023110 2183513415 265.95286.63475.68581.71520.53 394.30338.49569.58174.22277.79 362.55394.11665.37776.46463.08 = Y = RX + 7.3344.1999.1996.2089.048 3.7597.5378.4477.3135.692 0.0997.9393.6781.9396.318 + E

20 20 Is Y = R X + E Safe? R can be an arbitrary matrix, hence regression based attack wont work How about noisy ICA direct attack? Y = R X + E General Linear Transformation Model X = A S + N Noisy ICA Model

21 21 Scope (Part II) ssnnameziprace…ageSexBalincome…IntP 128223Asian…20M10k85k…2k 228223Asian…30F15k70k…18k 328262Black…20M50k120k…35k 428261White…26M45k23k…134k...…....…. N28223Asian…20M80k110k…15k 69% unique on zip and birth date 87% with zip, birth date and gender. k-anonymity, L-diversity SDC etc. Our approach: Randomized Response

22 22 Randomized Response ([ Stanley Warner; JASA 1965]) : Cheated in the exam : Didn ’ t cheat in the exam Cheated in exam Didn’t cheat Randomization device Do you belong to A? (p) Do you belong to ?(1-p) … “Yes” answer “No” answer As:Unbiased estimate of is:  Procedure: Purpose: Get the proportion( ) of population members that cheated in the exam. … Purpose

23 23 Matrix Expression RR can be expressed by matrix as: ( 0: No 1:Yes) =  Unbiased estimate of is:

24 24 One Polychotomous Attribute i j 1234 1 0.600.200.000.10 2 0.200.500.200.10 3 0.15 0.700.30 4 0.050.150.100.50  A sensitive attribute has t classes  A respondent belonging to the i th class will report j with respective probabilities,  E.g. If a respondent belonging to the 2nd category, he will report category 3 with 0.15 probability.

25 25 Vector Response  is the true proportions of the population  is the observed proportions in the survey  is the randomization device set by the interviewer. i j 1234 1 0.600.200.000.10 2 0.200.500.200.10 3 0.15 0.700.30 4 0.050.150.100.50 0.10 0.30 0.20 0.40 = 0.16 0.25 0.32 0.27 =

26 26 Analysis the dispersion matrix of the regular survey estimation nonnegative definite, represents the components of dispersion associated with RR experiment diagonal matrix with elements

27 27 Extension to Multi Attributes  m sensitive attributes: each has categories:  denote the true proportion corresponding to the combination be vector with elements,arranged lexicographically.  e.g., if m =2, t1 =2 and t2=3  Simultaneous Model  Consider all variables as one compounded variable and apply the regular vector response RR technique  Sequential Model  stands for Kronecker product

28 28 Kronecker Product Example = =

29 29 Analysis Similarly, the dispersion matrix can be decomposed to two parts: one corresponds to that of the regular survey estimation and the other corresponds to the components of dispersion associated with RR experiment

30 30 Outline(Part II) Randomization for Categorical Data Randomized Response Model To what extent it affects mining results? To what extent it protects privacy? Application in market basket 0-1 data analysis  Data swapping  Frequent itemsets or rule hiding  Inverse frequent itemset mining (SDM05, ICDM05)  Item randomization (PKDD07,PAKDD08)

31 31 Market Basket Data TIDmilksugarbread … cereals 1101 … 1 2011 … 1 3100 … 1 4111 … 0.... …. N011 … 0 1: presence 0: absence …  Association rule (R.Agrawal SIGMOD 1993)  with support and confidence Other measures in MBA Correlation, Lift, Interest, etc.

32 32 Item Perturbation TIDmilksugarbread … cereals 11011 20111 31001 41110.... …. N0110 TIDmilksugarbread … cereals 10111 21110 31111 40011.... …. N1101 Original DataRandomized Data Individual privacy is preserved!

33 33 Research Problems How it affects the accuracy of discovered association rules? How it affects the accuracy of other measures? Two scenarios known distortion probability Unknown distortion probability

34 34 Motivation example TIDmilksugarbread … cereals 11011 20111 31001 41110.... …. N0110 Original Data Randomized Data TIDmilksugarbread … cereals 10111 21110 31111 41011.... …. N0101 RR A: Milk B: Cereals 0.4150.0430.458 0.1830.3590.542 0.5980.402 0.3680.0970.465 0.2180.3170.537 0.5860.414 =(0.415,0.043,0.183,0.359)’ =(0.427,0.031,0.181,0.362)’ 0.662 0.671 We can get the estimate, how accurate we can achieve? =(0.368,0.097,0.218,0.316)’ Data miners Data owners

35 35 Motivation 31.5 35.9 36.3 22.1 12.3 23.8 Frequent set Not frequent set Estimated values Original values Rule 6 is falsely recognized from estimated value! Lower& Upper bound Frequent set with high confidence Frequent set without confidence Such errors can be avoided! Both are frequent set

36 36 Accuracy on Support S Estimate of support Variance of support Interquantile range (normal dist.) 0.362 0.3460.378

37 37 Accuracy on Confidence C Estimate of confidence A =>B Variance of confidence Interquantile range (ratio dist. is F(w))  Loose range derived on Chebyshev’s theorem where

38 38 Chebyshev's theorem Let be a random variable with expected value and variance.Then for any real Accuracy on Confidence c A lower bound to the proportion of data that are within a certain number of standard deviations from the mean

39 39 Based on the theorem, the loose interquantile range can be approximated as : For the previous example with item A, B, the 95% interquantile range of is: Accuracy on Confidence c

40 40 Accuracy vs. varying p for individual rule (a) Support (b) Confidence G => H (35.9%, 66.2%) from COIL P P

41 41 Accuracy Bounds With unknown distribution, Chebyshev theorm only gives loose bounds. Support bounds vs. varying p for G => H

42 42 Other measures Accuracy Bounds

43 43 Future work Conduct accuracy vs. disclosure analysis for general categorical data Develop a general randomization framework which combines Additive noise for numerical data Randomized response for categorical data Build a prototype system for real world applications Various query, analysis, and mining tasks Complex privacy requirements in different scenarios  non-confidential correlated attributes  potential combinations Dividends + Wages + Interests = Total Income


Download ppt "Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012."

Similar presentations


Ads by Google