Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Multiplicative Data Perturbations

Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection  Understanding Distance preservation Perturbation-invariant models  Attacks Privacy Evaluation Model Background knowledge and attack analysis Attack-resilient optimization  Comparison

Summary on additive perturbations  problems Weak to various attacks  Need to publish noise distribution  The column distribution is known Need to develop/revise data mining algorithms in order to utilize perturbed data  So far, we have only seen that decision tree and naïve bayes classifier can utilize additive perturbation.  Benefits Can be applied to both the Web model and the collaborative data pooling model Low cost

More thoughts about perturbation 1. Preserve Privacy Hide the original data  not easy to estimate the original values from the perturbed data Protect from data reconstruction techniques  The attacker has prior knowledge on the published data 2. Preserve Data Utility for Tasks Single-dimensional info  column data distribution, etc. Multi-dimensional info  Cov matrix, distance, etc

For most PP approaches… Privacy guarantee Data Utility/ Model accuracy ? Privacy guarantee Data utility/ Model accuracy Difficult to balance the two factors Subject to attacks May need new DM algorithms: randomization, cryptographic approaches

Multiplicative perturbations  Geometric data perturbation (GDP) Rotation data perturbation + Translation data perturbation + Noise addition  Random projection perturbation(RPP)  Sketch approach

Definition of Geometric Data Perturbation  G(X) = R*X + T + D R: random rotation T: random translation D: random noise, e.g., Gaussian noise Characteristics: R&T preserving distance, D slightly perturbing distance Example: ID001002 age3025 rent13501000 tax42303320 12 18 30 -.4.29 -1.7-1.1.131.2 ID001002 age1158943 rent31432424 tax-2919-2297.83-.40.40.2.86.46.53.30-.79 = * + + * Each component has its use to enhance the resilience to attacks!

Benefits of Geometric Data Perturbation Privacy guarantee Data Utility/ Model accuracy decoupled Make optimization and balancing easier! -Almost fully preserving model accuracy - we optimize privacy only Applicable to many DM algorithms -Distance-based Clustering -Classification: linear, KNN, Kernel, SVM,… Resilient to Attacks -the result of attack research

Limitations  Multiplicative perturbations are mostly used in outsourcing Cloud computing Can be applied to multiparty collaborative computing in same cases  Web model does not fit – perturbation parameters cannot be published

Definition of Random Projection Perturbation  F(X) = P*X X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m  Johnson-Lindenstrauss Lemma There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved.

Comparison between GDP and RPP  Privacy preservation Subject to similar kinds of attacks RPP is more resilience to distance-based attacks  Utility preservation(model accuracy) GDP preserves distances well RPP approximately preserves distances  Model accuracy is not guaranteed

Illustration of multiplicative data perturbation Preserving distances while perturbing each individual dimensions

A Model “invariant” to GDP …  If distance plays an important role Class/cluster members and decision boundaries are correlated in terms of distance, not the concrete locations Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Rotation and translation Class 1 Class 2 Slightly changed Classification boundary Distance perturbation (Noise addition) 2D Example:

Applicable DM algorithms  Modeling methods that depend on Euclidean geometric properties  Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms  K Nearest Neighbors  Kernel methods  Linear classifier  Support vector machines  Most regression models  And potentially more …

When to Use Multiplicative Data Perturbation Data Owner Service Provider/data user G(X)=RX+T+D Mined models/patterns G(X) F(G(X), ) Apply F to G(X new ) Good for the outsourcing model. Major issue!! curious service providers/data users try to break G(X)

Major issue: attacks!  Many existing Privacy Preserving methods are found not so effective when attacks are considered Ex: various data reconstruction algorithms to the random noise addition approach [Huang05][Guo06]  Prior knowledge Service provider Y has “PRIOR KNOWLEDGE” about X’s domain and nothing stops Y from using it to infer information in the sanitized data

Knowledge used to attack GDP  Three levels of knowledge Know nothing  naïve estimation Know column distributions  Independent Component Analysis Know specific input-output records (original points and their images in perturbed data)  distance inference

Methodology of attack analysis  An attack is an estimate of the original data Original O(x 1, x 2,…, x n ) vs. estimate P(x’ 1, x’ 2,…, x’ n ) How similar are these two series? One of the effective methods is to evaluate the MSE of the estimation – VAR(P-O) or STD(P-O)

Two multi-column privacy metrics q i : privacy guarantee for column i q i = std(P i –O i ), O i normalized column values, P i estimated column values Min privacy guarantee: the weakest link of all columns  min { q i, i=1..d} Avg privacy guarantee: overall privacy guarantee  1/d q i

Alternative metric  Based on Agarawal’s information theoretic measure: loss of privacy PI=1- 2 -I(X; X^), X^ is the estimation of X I(X; X^) = H(X) – H(X|X^) = H(X) – H(estimation error)  Exact estimation H(X|X^) =0, PI = 1-2 -H(X)  Random estimation I(X; X^) = 0, PI=0 Already normalized for different columns

Attack 1: naïve estimation  Estimate original points purely based on the perturbed data If using “random rotation” only  Intensity of perturbation matters  Points around origin Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 Classification boundary Class 1 Class 2 X Y

Counter naïve estimation  Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity  Fast_Opt algorithm in GDP  “Random translation” T Hide origin Increase difficulty of attacking!  Need to estimate R first, in order to find out T

Attack 2: ICA based attacks  Independent Component Analysis (ICA) Try to separate R and X from Y= R*X

Characteristics of ICA 1. Ordering of dimensions is not preserved. 2. Intensity (value range) is not preserved Conditions of effective ICA-attack 1.Knowing column distribution 2.Knowing value range.

Counter ICA attack  Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent columns  In reality… Most datasets have correlated columns We can find optimal rotation perturbation  maximizing the difficulty of ICA attacks

Original Perturbed Known point image Attack 3: distance-inference attack If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…

How is the Attack done …  Knowing points and their images … find exact images of the known points  Enumerate pairs by matched distances … Less effective for large data …  we assume pairs are successfully identified Estimation 1. Cancel random translation T from pairs (x, x’) 2. calculate R with pairs: Y=RX  R = Y*X -1 3. calculate T with R and known pairs

Counter distance-inference: Noise addition  Noise brings enough variance in estimation of R and T Now the attacker has to use regression to estimate R Then, use approximate R to estimate T  increase uncertainty  Regression 1. Cancel random translation T from pairs (x, x’) 2. estimate R with pairs: Y=RX  R = (Y*X T )(X*X T ) -1 3. Use the estimated R and known pairs to estimate T

Discussion  Can the noise be easily filtered? Need to know noise distribution, distribution of RX + T, Both distributions are not published, however. Attack analysis will be different from that for noise addition data perturbation Will PCA based noise filtering [Huang05] be effective?  What are the best estimation that the attacker can get?  If we treat the attack problem as a learning problem -- Minimum variance of error for the learner Higher bound of “loss of privacy”

Attackers with more knowledge?  What if attackers know a large amount of original records? Able to accurately estimate covariance matrix, column distribution, and column range, etc., of the original data Methods PCA, AK_ICA, …,etc can be used What do we do? If you have released so much original information… Stop releasing data anymore

A randomized perturbation optimization algorithm Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method 1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2 nd level)  find a better rotation R 2. Append a random translation component 3. Append an appropriate noise component

Comparison on methods  Privacy preservation In general, RPP should be better than GDP Evaluate the effect of attacks for GDP  ICA and distance perturbation need experimental evaluation  Utility preservation GDP:  R and T exactly preserve distances,  The effect of D needs experimental evaluation RPP  # of perturbed dimensions vs. utility  Datasets 12 datasets from UCI Data Repository

Privacy guarantee:GDP  In terms of naïve estimation and ICA-based attacks  Use only the random rotation and translation (R*X+T) components Worst perturbation (no optimization) Optimized for Naïve estimation only Optimized perturbation for both attacks

Privacy guarantee:GDP  In terms of distance inference attacks Use all three components (R*X +T+D) Noise D : Gaussian N(0,  2 ) Assume pairs of (original, image) are identified by attackers  no noise addition, privacy guarantee =0 Considerably high PG around small perturbation =0.1

Data utility: GDP with noise addition  Noise addition vs. model accuracy - noise: N(0, 0.1 2 )

Data Utility: RPP  Reduced # of dims vs. model accuracy KNN classifiersSVMs

Perceptrons

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Similar presentations

Presentation on theme: "Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.

Similar presentations

Presentation on theme: "Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random."— Presentation transcript:

Similar presentations

About project

Feedback