1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
An introduction to Principal Component Analysis (PCA)
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
Principal Component Analysis
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensional reduction, PCA
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Principal Component Analysis. Consider a collection of points.
Statistical Shape Models Eigenpatches model regions –Assume shape is fixed –What if it isn’t? Faces with expression changes, organs in medical images etc.
A Unifying Review of Linear Gaussian Models
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Robust PCA in Stata Vincenzo Verardi FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Additive Data Perturbation: data reconstruction attacks.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
CSE 185 Introduction to Computer Vision Face Recognition.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Naive Bayes Collaborative Filtering ICS 77B Max Welling.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CSE 4705 Artificial Intelligence
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Principal Component Analysis
CH 5: Multivariate Methods
Principal Component Analysis (PCA)
Principal Component Analysis
Additive Data Perturbation: data reconstruction attacks
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Techniques for studying correlation and covariance structure
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dimensionality Reduction
Principal Component Analysis
Presentation transcript:

1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University

2 Privacy-Preserving Data Mining Data Mining Data Collection Data Disguising Central Database Classification Association Rules Clustering

3 Random Perturbation + Original Data XRandom Noise RDisguised Data Y

4 How Secure is Randomization Perturbation?

5 A Simple Observation We can’t perturb the same number for several times. If we do that, we can estimate the original data: Let t be the original data, Disguised data: t + R 1, t + R 2, …, t + R m Let Z = [(t+R 1 )+ … + (t+R m )] / m Mean: E(Z) = t

6 This looks familiar … This is the data set (x, x, x, x, x, x, x, x) Random Perturbation: (x+r 1, x+r 2,……, x+r m ) We know this is NOT safe. Observation: the data set is highly correlated.

7 Let’s Generalize! Data set: (x 1, x 2, x 3, ……, x m ) If the correlation among data attributes are high, can we use that to improve our estimation (from the disguised data)?

8 Data Reconstruction (DR) Original Data X Disguised Data Y Distribution of random noise Reconstructed Data X’ What’s their difference? Data Reconstruction

9 Reconstruction Algorithms Principal Component Analysis (PCA) Bayes Estimate Method

10 PCA-Based Data Reconstruction

11 PCA-Based Reconstruction Disguised Information Reconstructed Information Squeeze Information Loss

12 How? Observation: Original data are correlated. Noise are not correlated. Principal Component Analysis Useful for lossy compression

13 PCA Introduction The main use of PCA: reduce the dimensionality while retaining as much information as possible. 1 st PC: containing the greatest amount of variation. 2 nd PC: containing the next largest amount of variation.

14 For the Original Data They are correlated. If we remove 50% of the dimensions, the actual information loss might be less than 10%.

15 For the Random Noises They are not correlated. Their variance is evenly distributed to any direction. If we remove 50% of the dimensions, the actual noise loss should be 50%.

16 PCA-Based Reconstruction Original Data X Disguised Data Reconstructed Data PCA Compression De-Compression

17 Bayes-Estimation-Based Data Reconstruction

18 A Different Perspective What is the Most likely X? Disguised Data Y Possible X Random Noise

19 The Problem Formulation For each possible X, there is a probability: P(X | Y). Find an X, s.t., P(X | Y) is maximized. How to compute P(X | Y)?

20 The Power of the Bayes Rule P(X|Y)P(X|Y) is difficult! P(X|Y)? P(Y|X)P(Y|X) P(Y)P(Y) P(X)P(X) *

21 Computing P(X | Y)? P(X|Y) = P(Y|X)* P(X) / P(Y) P(Y|X): remember Y = X + R P(Y): A constant (we don’t care) How to get P(X)? This is where the correlation can be used. Assume Multivariate Gaussian Distribution The parameters are unknown.

22 Multivariate Gaussian Distribution A Multivariate Gaussian distribution Each variable is a Gaussian distribution with mean  i Mean vector  = (  1,…,  m ) Covariance matrix  Both  and  can be estimated from Y So we can get P(X)

23 Bayes-Estimate-based Data Reconstruction Original X Disguised Data Y Randomization Estimated X Which X maximizes P(X|Y) P(X)P(Y|X)

24 Evaluation

25 Increasing the Number of Attributes

26 Increasing Eigenvalues of the Non-Principal Components

27 How to improve Random Perturbation?

28 Observation from PCA How to make it difficult to squeeze out noise? Make the correlation of the noise similar to the original data. Noise now concentrates on the principal components, like the original data X. How to get the correlation of X?

29 Improved Randomization

30 Conclusion And Future Work When does randomization fail: Answer: when the data correlation is high. Can it be cured? Using correlated noise similar to the original data Still Unknown: Is the correlated-noise approach really better? Can other information affect privacy?