SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte
SAC, Dijon, FranceApril 23-27, Source: laws.jpg
SAC, Dijon, FranceApril 23-27, Source: HIPAA for health care California State Bill 1386 Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy PIPEDA 2000 European Union (Directive 94/46/EC)
SAC, Dijon, FranceApril 23-27, Mining vs. Privacy Data mining The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) Individual Privacy Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers Privacy Preserving Data Mining (PPDM) How to “perturb” data such that we can build a good data mining model (data utility) while preserving individual’s privacy at the record level (privacy)?
SAC, Dijon, FranceApril 23-27, Outline Additive Randomization Distribution Reconstruction Bayesian Method Agrawal & Srikant SIGMOD00 EM Method Agrawal & Aggawal PODS01 Individual Value Reconstruction Spectral Filtering H. Kargupta ICDM03 PCA Technique Du et al. SIGMOD05 Error Bound Analysis for Spectral Filtering Upper Bound Conclusion and Future Work
SAC, Dijon, FranceApril 23-27, Additive Randomization To hide the sensitive data by randomly modifying the data values using some additive noise Privacy preserving aims at and Utility preserving aims at The aggregate characteristics remain unchanged or can be recovered
SAC, Dijon, FranceApril 23-27, Distribution Reconstruction The original density distribution can be reconstructed effectively given the perturbed data and the noise's distribution -- – Agrawal & Srikant SIGMOD 2000 Independent random noises with any distribution f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := j := j+1 until (stopping criterion met) It can not reconstruct individual value
SAC, Dijon, FranceApril 23-27, Individual Value Reconstruction Spectral Filtering, Kargupta et al. ICDM Apply EVD : 2. Using some published information about V, extract the first k components of as the principal components. – and are the corresponding eigenvectors. – forms an orthonormal basis of a subspace. 3. Find the orthogonal projection on to : 4. Get estimate data set: PCA Technique, Huang, Du and Chen, SIGMOD 05
SAC, Dijon, FranceApril 23-27, Motivation Previous work on individual reconstruction are only empirical The relationship between the estimation accuracy and the noise was not clear Two questions Attacker question: How close the estimated data using SF is from the original one? Data owner question: How much noise should be added to preserve privacy at a given tolerated level?
SAC, Dijon, FranceApril 23-27, Our Work Investigate the explicit relationship between the estimation accuracy and the noise Derive one upper bound of in terms of V The upper bound determines how close the estimated data achieved by attackers is from the original one It imposes a serious threat of privacy breaches
SAC, Dijon, FranceApril 23-27, Preliminary F-norm and 2-norm Some properties and,the square root of the largest eigenvalue of A T A If A is symmetric, then,the largest eigenvalue of A
SAC, Dijon, FranceApril 23-27, Matrix Perturbation Traditional Matrix perturbation theory How the derived perturbation E affects the co- variance matrix A Our scenario How the primary perturbation V affects the data matrix U AE+
SAC, Dijon, FranceApril 23-27, Error Bound Analysis Prop 1. Let covariance matrix of the perturbed data be. Given and Prop 2. (eigenvalue of E) (eigengap)
SAC, Dijon, FranceApril 23-27, Theorem Given a date set and a noise set we have the perturbed data set. Let be the estimation obtained from the Spectral Filtering, then where is the derived perturbation on the original covariance matrix A = UU T Proof is skipped
SAC, Dijon, FranceApril 23-27, Special Cases When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean and known variance When the noise is completely correlated with data
SAC, Dijon, FranceApril 23-27, Experimental Results Artificial Dataset 35 correlated variables 30,000 tuples
SAC, Dijon, FranceApril 23-27, Experimental Results Scenarios of noise addition Case 1: i.i.d. Gaussian noise N(0,COV), where COV = diag(σ 2,…, σ 2 ) Case 2: Independent Gaussian noise N(0,COV), where COV = c * diag(σ 1 2, …, σ n 2 ) Case 3: Correlated Gaussian noise N(0,COV), where COV = c * Σ U (or c * A……) Measure Absolute error Relative error
SAC, Dijon, FranceApril 23-27, Determining k Determine k in Spectral Filtering According to Matrix Perturbation Theory Our heuristic approach: check K =
SAC, Dijon, FranceApril 23-27, Effect of varying k (case 1) N(0,COV), where COV = diag(σ 2,…, σ 2 ) relative error ||V|| F σ2σ K= K= K= *0.31 K=4*0.09*0.12*0.22* K=
SAC, Dijon, FranceApril 23-27, Effect of varying k (case 2) N(0,COV), where COV = c * diag(σ 1 2, σ 2 2 …, σ n 2 ) relative error ||V|| F c K= K= *0.30*0.36 K= K=4*0.07*0.11* K=
SAC, Dijon, FranceApril 23-27, Effect of varying k (case 3) N(0,COV), where COV = c * Σ U ||V|| F c K= *1.17 K= K= K=4*0.27*0.38*0.65* K=
SAC, Dijon, FranceApril 23-27, σ 2 =0.5σ 2 =0.1 Effect of varying noise σ 2 =1.0 ||V|| F /||U|| F = 87.8%
SAC, Dijon, FranceApril 23-27, Case 1 Effect of covariance matrix Case 3 Case 2 ||V|| F /||U|| F = 39.1%
SAC, Dijon, FranceApril 23-27, Conclusion Spectral filtering based technique has been investigated as a major means of point-wise data reconstruction. We present the upper bound which enables attackers to determines how close the estimated data achieved by attackers is from the original one
SAC, Dijon, FranceApril 23-27, Future Work We are working on the lower bound which represents the best estimate the attacker can achieve using SF which can be used by data owners to determine how much noise should be added to preserve privacy Bound analysis at point-wise level
SAC, Dijon, FranceApril 23-27, Acknowledgement NSF Grant CCR IIS Personnel Xintao Wu Songtao Guo Ling Guo More Info
SAC, Dijon, FranceApril 23-27, Questions? Thank you!