Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Linear Classifiers (perceptrons)
By: Michael Vorobyov. Moments In general, moments are quantitative values that describe a distribution by raising the components to different powers.
Data Mining Classification: Alternative Techniques
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Machine learning continued Image source:
PDM Workshop April 8, 2006 Deriving Private Information from Perturbed Data Using IQR-based Approach Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Dimension reduction (1)
Indian Statistical Institute Kolkata
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
CS Instance Based Learning1 Instance Based Learning.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Additive Data Perturbation: data reconstruction attacks.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Chapter 13 (Prototype Methods and Nearest-Neighbors )
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Secure Data Outsourcing
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Data Security and Privacy Keke Chen
Semi-Supervised Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning Basics
Differential Privacy in Practice
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Goodfellow: Chapter 14 Autoencoders
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Support Vector Machines
Multiplicative Data Perturbations (1)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity

Outline Review and critique of randomization approaches (additive noise) Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection Comparison

slide 3 Additive noise (randomization) x1…xnx1…xn Reveal entire database, but randomize entries Database x1+1…xn+nx1+1…xn+n Add random noise  i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i User

Learning decision tree on randomized data 50 | 40K |...30 | 70K | Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K | becomes 65 (30+35) Alice’s age Add random number to Age

Summary on additive perturbations Benefits Easy to apply – applied separately to each data point (record) Low cost Can be used for both web model and corporate model Web Apps data user 1User 2User n Private info x1…xnx1…xn x1+1…xn+nx1+1…xn+n

Additive perturbations - privacy Need to publish noise distribution The column distribution is disclosed Subject to data value attacks! On the Privacy Preserving Properties of Random Data Perturbation Techniques, Kargupta, 2003a

The spectral filtering technique can be used to estimate the original data

The spectral filtering technique can perform poorly when there is an inherent random component in the original data

Randomization – data utility Only preserves column distribution Need to redesign/modify existing data mining algorithms Limited data mining applications Decision tree and naïve bayes classifier

Randomization approaches Privacy guarantee Data Utility/ Model accuracy ? Privacy guarantee Data utility/ Model accuracy Difficult to balance the two factors Low data utility Subject to attacks

More thoughts about perturbation 1. Preserve Privacy Hide the original data not easy to estimate the original values from the perturbed data Protect from data reconstruction techniques The attacker has prior knowledge on the published data 2. Preserve Data Utility for Tasks Single-dimensional properties - column distribution, etc. Decision tree, Bayesian classifier Multi-dimensional properties - covariance matrix, distance, etc  SVM classifier, knn classification, clustering

Multiplicative perturbations Preserving multidimensional data properties Geometric data perturbation (GDP) [Chen ’07] Rotation data perturbation Translation data perturbation Noise addition Random projection perturbation(RPP) [Liu ‘06] Chen, K. and Liu, L. Towards attack-resilient geometric data perturbation. SDM, 2007 Liu, K., Kargupt, H., and Ryan, J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. TKDE, 2006

Rotation Perturbation G(X) = R*X Key features preserves Euclidean distance and inner product of data points preserves geometric shapes such as hyperplane and hyper curved surfaces in the multidimensional space R m*m - an orthonormal matrix (R T R = RR T = I) X m*n - original data set with n m-dimensional data points G(X) m*n - rotated data set Example: ID age3025 rent tax ID age rent tax = *

Illustration of multiplicative data perturbation Preserving distances while perturbing each individual dimensions

Data properties A model is invariant to geometric perturbation if distance plays an important role Class/cluster members and decision boundaries are correlated in terms of distance, not the concrete locations Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Rotation and translation Class 1 Class 2 Slightly changed Classification boundary Distance perturbation (Noise addition) 2D Example:

Applicable DM algorithms Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …

When to Use Multiplicative Data Perturbation Data Owner Service Provider/data user G(X)=RX+T+D Mined models/patterns G(X) F(G(X),  ) Good for the corporate model or dataset publishing. Major issue!! curious service providers/data users try to break G(X)

Attacks! Three levels of knowledge Know nothing  naïve estimation Know column distributions  Independent Component Analysis Know specific points (original points and their images in perturbed data)  distance inference

Attack 1: naïve estimation Estimate original points purely based on the perturbed data If using “random rotation” only Intensity of perturbation matters Points around origin Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 X Y

Countering naïve estimation Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity Fast_Opt algorithm in GDP “Random translation” T Hide origin Increase difficulty of attacking! Need to estimate R first, in order to find out T

Attack 2: ICA based attacks Independent Component Analysis (ICA) Try to separate R and X from Y= R*X

Characteristics of ICA 1. Ordering of dimensions is not preserved. 2. Intensity (value range) is not preserved Conditions of effective ICA-attack 1.Knowing column distribution 2.Knowing value range.

Countering ICA attack Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent columns In reality… Most datasets have correlated columns We can find optimal rotation perturbation  maximizing the difficulty of ICA attacks

Original Perturbed Known point image Attack 3: distance-inference attack If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

How is the Attack done … Knowing points and their images … find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified Estimation 1. Cancel random translation T from pairs (x, x’) 2. calculate R with pairs: Y=RX  R = Y*X calculate T with R and known pairs

Countering distance-inference: Noise addition Noise brings enough variance in estimation of R and T Can the noise be easily filtered? Need to know noise distribution, Need to know distribution of RX + T, Both distributions are not published, however. Note: It is very different from the attacks to noise addition data perturbation [Kargupta03]

Attackers with more knowledge? What if attackers know large amount of original records? Able to accurately estimate covariance matrix, column distribution, and column range, etc., of the original data Methods PCA,etc can be used What do we do? Stop releasing any kind of data anymore

Benefits of Geometric Data Perturbation Privacy guarantee Data Utility/ Model accuracy decoupled Applicable to many DM algorithms -Distance-based Clustering -Classification: linear, KNN, Kernel, SVM,… Make optimization and balancing easier! - Almost fully preserving model accuracy - we optimize privacy only

A randomized perturbation optimization algorithm Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method 1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2 nd level)  find a better rotation R 2. Append a random translation component 3. Append an appropriate noise component

Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation components (R*X+T) Worst perturbation (no optimization) Optimized for Naïve estimation only Optimized perturbation for both attacks

Privacy guarantee:GDP In terms of distance inference attacks Use all three components (R*X +T+D) Noise D : Gaussian N(0,  2 ) Assume pairs of (original, image) are identified by attackers  no noise addition, privacy guarantee =0 Considerably high PG at small perturbation  =0.1

Data utility: GDP with noise addition Noise addition vs. model accuracy - noise: N(0, ) Boolean data is more sensitive to distance perturbation

Random Projection Perturbation Random projection projects a set of data points from high dimensional space to a lower dimensional subspace F(X) = P*X X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m Johnson-Lindenstrauss Lemma There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved.

Data Utility: RPP Reduced # of dims vs. model accuracy KNN classifiersSVMs

Random projection vs. geometric perturbation Privacy preservation Subject to similar kinds of attacks RPP is more resilient to distance-based attacks Utility preservation(model accuracy) GDP: R and T exactly preserve distances, The effect of D needs experimental evaluation RPP Approximately preserves distances # of perturbed dimensions vs. utility

Coming up Output perturbation Cryptographic protocols

Methodology of attack analysis An attack is an estimate of the original data Original O(x 1, x 2,…, x n ) vs. estimate P(x’ 1, x’ 2,…, x’ n ) How similar are these two series? One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00] Var (P–O) or std(P-O), P: estimated, O: original

Two multi-column privacy metrics q i : privacy guarantee for column i q i = std(P i –O i ), O i normalized column values, P i estimated column values Min privacy guarantee: the weakest link of all columns  min { q i, i=1..d} Avg privacy guarantee: overall privacy guarantee  1/d  q i