Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Active Appearance Models
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Linear Classifiers (perceptrons)
By: Michael Vorobyov. Moments In general, moments are quantitative values that describe a distribution by raising the components to different powers.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Indian Statistical Institute Kolkata
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Computer vision: models, learning and inference
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Deriving Private Information from Randomized Data Zhengli Huang Wenliang (Kevin) Du Biao Chen Syracuse University.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Right Protection via Watermarking with Provable Preservation of Distance-based Mining Spyros Zoumpoulis Joint work with Michalis Vlachos, Nick Freris and.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Masquerade Detection Mark Stamp 1Masquerade Detection.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Additive Data Perturbation: data reconstruction attacks.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Secure Data Outsourcing
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Data Security and Privacy Keke Chen
Data Transformation: Normalization
Semi-Supervised Clustering
Privacy-Preserving Data Mining
Background on Classification
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Principal Component Analysis (PCA)
Additive Data Perturbation: data reconstruction attacks
Differential Privacy in Practice
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning
Multiplicative Data Perturbations (1)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Multiplicative Data Perturbations

Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection  Understanding Distance preservation Perturbation-invariant models  Attacks Privacy Evaluation Model Background knowledge and attack analysis Attack-resilient optimization  Comparison

Summary on additive perturbations  problems Weak to various attacks  Need to publish noise distribution  The column distribution is known Need to develop/revise data mining algorithms in order to utilize perturbed data  So far, we have only seen that decision tree and naïve bayes classifier can utilize additive perturbation.  Benefits Can be applied to both the Web model and the corporate model Low cost

More thoughts about perturbation 1. Preserve Privacy Hide the original data  not easy to estimate the original values from the perturbed data Protect from data reconstruction techniques  The attacker has prior knowledge on the published data 2. Preserve Data Utility for Tasks Single-dimensional info  column data distribution, etc. Multi-dimensional info  Cov matrix, distance, etc

For most PP approaches… Privacy guarantee Data Utility/ Model accuracy ? Privacy guarantee Data utility/ Model accuracy Difficult to balance the two factors Subject to attacks May need new DM algorithms: randomization, cryptographic approaches

Multiplicative perturbations  Geometric data perturbation (GDP) Rotation data perturbation Translation data perturbation Noise addition  Random projection perturbation(RPP)

Definition of Geometric Data Perturbation  G(X) = R*X + T + D R: random rotation T: random translation D: random noise, e.g., Gaussian noise Characteristics: R&T preserving distance, D slightly perturbing distance Example: ID age3025 rent tax ID age rent tax = * + + * Each component has its use to enhance the resilience to attacks!

Benefits of Geometric Data Perturbation Privacy guarantee Data Utility/ Model accuracy decoupled Make optimization and balancing easier! -Almost fully preserving model accuracy - we optimize privacy only Applicable to many DM algorithms -Distance-based Clustering -Classification: linear, KNN, Kernel, SVM,… Resilient to Attacks -the result of attack research

Definition of Random Projection Perturbation  F(X) = P*X X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m  Johnson-Lindenstrauss Lemma There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved.

Comparison between GDP and RPP  Privacy preservation Subject to similar kinds of attacks RPP is more resilience to distance-based attacks  Utility preservation(model accuracy) GDP preserves distances well RPP approximately preserves distances  Model accuracy is not guaranteed

Illustration of multiplicative data perturbation Preserving distances while perturbing each individual dimensions

A Model “invariant” to GDP …  If distance plays an important role Class/cluster members and decision boundaries are correlated in terms of distance, not the concrete locations Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Rotation and translation Class 1 Class 2 Slightly changed Classification boundary Distance perturbation (Noise addition) 2D Example:

Applicable DM algorithms  Modeling methods that depend on Euclidean geometric properties  Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms  K Nearest Neighbors  Kernel methods  Linear classifier  Support vector machines  Most regression models  And potentially more …

When to Use Multiplicative Data Perturbation Data Owner Service Provider/data user G(X)=RX+T+D Mined models/patterns G(X) F(G(X), ) Apply F to G(X new ) Good for the corporate model or dataset publishing. Major issue!! curious service providers/data users try to break G(X)

Major issue: attacks!!  Many existing PP methods are found not so effective when attacks are considered Ex: various data reconstruction algorithms to the random noise addition approach [Huang05][Guo06]  Prior knowledge Service provider Y has “PRIOR KNOWLEDGE” about X’s domain and nothing stops Y from using it to infer information in the sanitized data

Knowledge used to attack GDP  Three levels of knowledge Know nothing  naïve estimation Know column distributions  Independent Component Analysis Know specific points (original points and their images in perturbed data)  distance inference

Methodology of attack analysis  An attack is an estimate of the original data Original O(x 1, x 2,…, x n ) vs. estimate P(x’ 1, x’ 2,…, x’ n ) How similar are these two series? One of the effective methods is to evaluate the variance/standard deviation of the difference [Rakesh00] Var (P–O) or std(P-O), P: estimated, O: original

Two multi-column privacy metrics q i : privacy guarantee for column i q i = std(P i –O i ), O i normalized column values, P i estimated column values Min privacy guarantee: the weakest link of all columns  min { q i, i=1..d} Avg privacy guarantee: overall privacy guarantee  1/d q i

Attack 1: naïve estimation  Estimate original points purely based on the perturbed data If using “random rotation” only  Intensity of perturbation matters  Points around origin Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 X Y

Counter naïve estimation  Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity  Fast_Opt algorithm in GDP  “Random translation” T Hide origin Increase difficulty of attacking!  Need to estimate R first, in order to find out T

Attack 2: ICA based attacks  Independent Component Analysis (ICA) Try to separate R and X from Y= R*X

Characteristics of ICA 1. Ordering of dimensions is not preserved. 2. Intensity (value range) is not preserved Conditions of effective ICA-attack 1.Knowing column distribution 2.Knowing value range.

Counter ICA attack  Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent columns  In reality… Most datasets have correlated columns We can find optimal rotation perturbation  maximizing the difficulty of ICA attacks

Original Perturbed Known point image Attack 3: distance-inference attack If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

How is the Attack done …  Knowing points and their images … find exact images of the known points  Enumerate pairs by matched distances … Less effective for large data …  we assume pairs are successfully identified Estimation 1. Cancel random translation T from pairs (x, x’) 2. calculate R with pairs: Y=RX  R = Y*X calculate T with R and known pairs

Counter distance-inference: Noise addition  Noise brings enough variance in estimation of R and T Now the attacker has to use regression to estimate R Then, use approximate R to estimate T  increase uncertainty  Can the noise be easily filtered? Need to know noise distribution, Need to know distribution of RX + T, Both distributions are not published, however. Note: It is very different from the attacks to noise addition data perturbation [Kargupta03, Huang05]

Attackers with more knowledge?  What if attackers know large amount of original records? Able to accurately estimate covariance matrix, column distribution, and column range, etc., of the original data Methods PCA, AK_ICA, …,etc can be used What do we do? You have released so much original information… Stop releasing any kind of data anymore

A randomized perturbation optimization algorithm Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method 1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2 nd level)  find a better rotation R 2. Append a random translation component 3. Append an appropriate noise component

Comparison on methods  Privacy preservation In general, RPP should be better than GDP Evaluate the effect of attacks for GDP  ICA and distance perturbation need experimental evaluation  Utility preservation GDP:  R and T exactly preserve distances,  The effect of D needs experimental evaluation RPP  # of perturbed dimensions vs. utility  Datasets 12 datasets from UCI Data Repository

Privacy guarantee:GDP  In terms of naïve estimation and ICA-based attacks  Use only the random rotation and translation (R*X+T) components Worst perturbation (no optimization) Optimized for Naïve estimation only Optimized perturbation for both attacks

Privacy guarantee:GDP  In terms of distance inference attacks Use all three components (R*X +T+D) Noise D : Gaussian N(0,  2 ) Assume pairs of (original, image) are identified by attackers  no noise addition, privacy guarantee =0 Considerably high PG at small perturbation =0.1

Data utility: GDP with noise addition  Noise addition vs. model accuracy - noise: N(0, ) Boolean data is more sensitive to distance perturbation

Data Utility: RPP  Reduced # of dims vs. model accuracy KNN classifiersSVMs

Perceptrons