SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Slides:

Advertisements

Similar presentations

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Advertisements

Covariance Matrix Applications

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Dimension reduction (1)

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.

An introduction to Principal Component Analysis (PCA)

Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.

Estimating Surface Normals in Noisy Point Cloud Data Niloy J. Mitra, An Nguyen Stanford University.

Principal Component Analysis

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

Privacy-PreservingEigentaste-based Collaborative Filtering Ibrahim Yakut and Huseyin Polat Department of Computer Engineering.

Demo, May 2005 Privacy Preserving Database Application Testing Xintao Wu, Yongge Wang, Yuliang Zheng, UNC Charlotte.

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.

SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Face Recognition Using Eigenfaces

Independent Component Analysis (ICA) and Factor Analysis (FA)

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.

Bayesian belief networks 2. PCA and ICA

1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

Modern Navigation Thomas Herring

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Diffusion Maps and Spectral Clustering

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

Additive Data Perturbation: data reconstruction attacks.

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.

1 Distributed Detection of Network-Wide Traffic Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan* Anthony Joseph*

Additive Data Perturbation: the Basic Problem and Techniques.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Modern Navigation Thomas Herring MW 11:00-12:30 Room

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &

Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.

Lecture 2: Statistical learning primer for biologists

Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.

1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

Feature Extraction 主講人：虞台文.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Principal Components Analysis ( PCA)

Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

CSE 554 Lecture 8: Alignment

Privacy-Preserving Data Mining

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Outlier Processing via L1-Principal Subspaces

Principal Component Analysis (PCA)

Additive Data Perturbation: data reconstruction attacks

Feature space tansformation methods

Multiplicative Data Perturbations (1)

Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.

Principal Component Analysis

NOISE FILTER AND PC FILTERING

Outline Variance Matrix of Stochastic Variables and Orthogonal Transforms Principle Component Analysis Generalized Eigenvalue Decomposition.

Presentation transcript:

SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte

SAC, Dijon, FranceApril 23-27, Source: laws.jpg

SAC, Dijon, FranceApril 23-27, Source: HIPAA for health care  California State Bill 1386 Grann-Leach-Bliley Act for financial COPPA for childern’s online privacy PIPEDA 2000 European Union (Directive 94/46/EC)

SAC, Dijon, FranceApril 23-27, Mining vs. Privacy  Data mining The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution)  Individual Privacy Individual values in database must not be disclosed, or at least no close estimation can be derived by attackers  Privacy Preserving Data Mining (PPDM) How to “perturb” data such that  we can build a good data mining model (data utility)  while preserving individual’s privacy at the record level (privacy)?

SAC, Dijon, FranceApril 23-27, Outline  Additive Randomization Distribution Reconstruction  Bayesian Method Agrawal & Srikant SIGMOD00  EM Method Agrawal & Aggawal PODS01 Individual Value Reconstruction  Spectral Filtering H. Kargupta ICDM03  PCA Technique Du et al. SIGMOD05  Error Bound Analysis for Spectral Filtering Upper Bound  Conclusion and Future Work

SAC, Dijon, FranceApril 23-27, Additive Randomization  To hide the sensitive data by randomly modifying the data values using some additive noise  Privacy preserving aims at and  Utility preserving aims at The aggregate characteristics remain unchanged or can be recovered

SAC, Dijon, FranceApril 23-27, Distribution Reconstruction  The original density distribution can be reconstructed effectively given the perturbed data and the noise's distribution -- – Agrawal & Srikant SIGMOD 2000 Independent random noises with any distribution f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := j := j+1 until (stopping criterion met)  It can not reconstruct individual value

SAC, Dijon, FranceApril 23-27, Individual Value Reconstruction  Spectral Filtering, Kargupta et al. ICDM Apply EVD : 2. Using some published information about V, extract the first k components of as the principal components. – and are the corresponding eigenvectors. – forms an orthonormal basis of a subspace. 3. Find the orthogonal projection on to : 4. Get estimate data set: PCA Technique, Huang, Du and Chen, SIGMOD 05

SAC, Dijon, FranceApril 23-27, Motivation  Previous work on individual reconstruction are only empirical The relationship between the estimation accuracy and the noise was not clear  Two questions Attacker question: How close the estimated data using SF is from the original one? Data owner question: How much noise should be added to preserve privacy at a given tolerated level?

SAC, Dijon, FranceApril 23-27, Our Work  Investigate the explicit relationship between the estimation accuracy and the noise  Derive one upper bound of in terms of V The upper bound determines how close the estimated data achieved by attackers is from the original one It imposes a serious threat of privacy breaches

SAC, Dijon, FranceApril 23-27, Preliminary  F-norm and 2-norm  Some properties and,the square root of the largest eigenvalue of A T A If A is symmetric, then,the largest eigenvalue of A

SAC, Dijon, FranceApril 23-27, Matrix Perturbation  Traditional Matrix perturbation theory How the derived perturbation E affects the covariance matrix A  Our scenario How the primary perturbation V affects the data matrix U AE+

SAC, Dijon, FranceApril 23-27, Error Bound Analysis   Prop 1. Let covariance matrix of the perturbed data be. Given and  Prop 2. (eigenvalue of E) (eigengap)

SAC, Dijon, FranceApril 23-27, Theorem  Given a date set and a noise set we have the perturbed data set. Let be the estimation obtained from the Spectral Filtering, then where is the derived perturbation on the original covariance matrix A = UU T Proof is skipped 

SAC, Dijon, FranceApril 23-27, Special Cases  When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean and known variance  When the noise is completely correlated with data

SAC, Dijon, FranceApril 23-27, Experimental Results  Artificial Dataset  35 correlated variables  30,000 tuples

SAC, Dijon, FranceApril 23-27, Experimental Results  Scenarios of noise addition Case 1: i.i.d. Gaussian noise  N(0,COV), where COV = diag(σ 2,…, σ 2 ) Case 2: Independent Gaussian noise  N(0,COV), where COV = c * diag(σ 1 2, …, σ n 2 ) Case 3: Correlated Gaussian noise  N(0,COV), where COV = c * Σ U (or c * A……)  Measure Absolute error Relative error

SAC, Dijon, FranceApril 23-27, Determining k  Determine k in Spectral Filtering According to Matrix Perturbation Theory Our heuristic approach:  check  K =

SAC, Dijon, FranceApril 23-27, Effect of varying k (case 1)  N(0,COV), where COV = diag(σ 2,…, σ 2 ) relative error ||V|| F σ2σ K= K= K= *0.31 K=4*0.09*0.12*0.22* K=

SAC, Dijon, FranceApril 23-27, Effect of varying k (case 2) N(0,COV), where COV = c * diag(σ 1 2, σ 2 2 …, σ n 2 ) relative error ||V|| F c K= K= *0.30*0.36 K= K=4*0.07*0.11* K=

SAC, Dijon, FranceApril 23-27, Effect of varying k (case 3)  N(0,COV), where COV = c * Σ U ||V|| F c K= *1.17 K= K= K=4*0.27*0.38*0.65* K=

SAC, Dijon, FranceApril 23-27, σ 2 =0.5σ 2 =0.1 Effect of varying noise σ 2 =1.0 ||V|| F /||U|| F = 87.8%

SAC, Dijon, FranceApril 23-27, Case 1 Effect of covariance matrix Case 3 Case 2 ||V|| F /||U|| F = 39.1%

SAC, Dijon, FranceApril 23-27, Conclusion  Spectral filtering based technique has been investigated as a major means of point-wise data reconstruction.  We present the upper bound which enables attackers to determines how close the estimated data achieved by attackers is from the original one

SAC, Dijon, FranceApril 23-27, Future Work  We are working on the lower bound which represents the best estimate the attacker can achieve using SF which can be used by data owners to determine how much noise should be added to preserve privacy  Bound analysis at point-wise level

SAC, Dijon, FranceApril 23-27, Acknowledgement  NSF Grant CCR IIS  Personnel Xintao Wu Songtao Guo Ling Guo  More Info

SAC, Dijon, FranceApril 23-27, Questions? Thank you!