Additive Data Perturbation: the Basic Problem and Techniques.

Slides:



Advertisements
Similar presentations
Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Information Theory For Data Management
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
What is Statistical Modeling
Visual Recognition Tutorial
Mutual Information Mathematical Biology Seminar
Decision Tree Algorithm
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Evaluating Hypotheses
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Thanks to Nir Friedman, HU
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Data mining and machine learning A brief introduction.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Random Sampling, Point Estimation and Maximum Likelihood.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Additive Data Perturbation: data reconstruction attacks.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Lecture Notes for Chapter 4 Introduction to Data Mining
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Intro to Probability Slides from Professor Pan,Yan, SYSU.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 1.31 Criteria for optimal reception of radio signals.
Oliver Schulte Machine Learning 726
Chapter 7. Classification and Prediction
Privacy-Preserving Data Mining
Probability Theory and Parameter Estimation I
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Clustering Evaluation The EM Algorithm
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Multiplicative Data Perturbations (1)
LECTURE 23: INFORMATION THEORY REVIEW
Applied Statistics and Probability for Engineers
Presentation transcript:

Additive Data Perturbation: the Basic Problem and Techniques

Outline  Motivation  Definition  Privacy metrics  Distribution reconstruction methods  Privacy-preserving data mining with additive data perturbation  Summary

Motivation  Web-based computing  Observations Only a few sensitive attributes need protection Allow individual user to perform protection with low cost Some data mining algorithms work on distribution instead of individual records Web Apps data user 1 Private info

 Definition of dataset Column by row table Each row is a record, or a vector Each column represents an attribute We also call it multidimensional data ABC A 3-dimensional record 2 records in the 3-attribute dataset

Additive perturbation  Definition Z = X+Y X is the original value, Y is random noise and Z is the perturbed value Data Z and the parameters of Y are published  e.g., Y is Gaussian N(0,1)  History Used in statistical databases to protect sensitive attributes (late 80s to 90s)  Benefit Allow distribution reconstruction Allow individual user to do perturbation  Need to publish the noise distribution, however

Applications in data mining  Distribution reconstruction algorithms Rakesh’s algorithm Expectation-Maximization (EM) algorithm  Column-distribution based algorithms Decision tree Naïve Bayes classifier

Major issues  Privacy metrics  Preserving information Distribution reconstruction algorithms Loss of information A tradeoff between loss of information and privacy

Privacy metrics for additive perturbation  Variance/confidence based definition  Mutual information based definition

Variance/confidence based definition  Method Based on attacker’s view: value estimation  Knowing perturbed data, and noise distribution  No other prior knowledge Estimation method Perturbed value Confidence interval: the range having c% prob that the real value is in Y: zero mean, std   is the important factor, i.e., var(Z-X) =  2 Given Z, X is distant from Z in the Z+/- range with c% conf We often ignore the confidence c% and use  to represent the difficulty of value estimation.

Problem with Var/conf metric  No knowledge about the original data is incorporated Knowledge about the original data distribution  Range of original values, etc.  which will be discovered with distribution reconstruction, in additive perturbation  can be known in prior in some applications Other prior knowledge may introduce more types of attacks  Privacy evaluation needs to incorporate these attacks

 Mutual information based method incorporating the original data distribution Concept: Uncertainty  entropy  Difficulty of estimation… the amount of privacy… Intuition: knowing the perturbed data Z and the noise Y distribution, how much uncertainty of X is reduced.  Z,Y do not help in estimate X  all uncertainty of X is preserved: privacy = 1  Otherwise: 0<= privacy <1

some information theory  Definition of mutual information Entropy: h(A)  evaluate uncertainty of A  - sum a in A p(a) log p(a)  Not easy to estimate  high entropy  Distributions with the same variance  uniform has the largest entropy Conditional entropy: h(A|B)  sum b in B p(b) h(A|B=b)  If we know the random variable B, how much is the uncertainty of A  If B is not independent of A, the uncertainty of A can be reduced, (B helps explain A) i.e., h(A|B) <h(A) Mutual information I(A;B) = h(A)-h(A|B)  the information brought by B in estimating A  Note: I(A;B) == I(B;A)

 Inherent privacy of a random variable Using uniform variable as the reference (the maximum case), denoted as 2 h(A)  MI based privacy metric P(A|B) = 1-2 -I(A;B) defines the lost privacy I(A;B) =0  B does not help estimate A  Privacy is fully preserved, the lost privacy P(A|B) =0 I(A;B) >0  0<P(A|B)<1  Calculation for additive perturbation: I(X;Z) = h(Z) – h(Z|X) = h(Z) – h(Y), due to p(X+Y|X) = p(Y)

Distribution reconstruction  Problem: Z= X+Y Know noise Y’s distribution Fy Know the perturbed values z1, z2,…zn Estimate the distribution Fx  Basic methods Rakesh’s method: Bayes method EM estimation: maximum likelihood

Rakesh’s algorithm (paper 10)  Find distribution P(X|X+Y)  three key points to understand it Bayes rule:  P(X|X+Y) = P(X+Y|X) P(X)/P(X+Y) Conditional prob  f x+y (X+Y=w|X=x) = f y (w-x) Prob at the point a uses the average of all sample estimates Using fx(a)?

 The iterative algorithm Stop criterion: the difference between two consecutive fx estimates is small

Make it more efficient…  Bintize the range of x  Discretize the previous formula x m(x) mid-point of the bin that x is in Lt = length of interval t

 Weakness of Rakesh’s algorithm No convergence proof Don’t know if the iteration gives the globally optimal result

EM algorithm  Using discretized bins to approximate the distribution x Density (the height) of Bin i is notated as i I(x) is an indicator function: I(x) =1 if x in the range For a specific x, f (x) returns some theta_i

 Maximum Likelihood Estimation (MLE) method X1,x2,…, xn are Independent and identically distributed Joint distribution  f(x1,x2,…,xn|  ) = f(x1|  )*f(x2|  )*…f(xn|  ) MLE principle:  Find  that maximizes f(x1,x2,…,xn|  )  Equivalent to maximizing log f(x1,x2,…,xn|  ) = sum log f(xi|  )

 Basic idea of the EM alogrithm Q(,^) is the MLE function   is the bin densities ( 1, 2,… k), and ^ is the previous estimate of . EM algorithm 1.Initial ^ : uniform distribution 2.In each iteration: find the current  that maximize Q(,^) based on previous estimate ^, and z zj – upper(i)<=Y <=zj – lower(i)

Understanding it If Z = X + Y, Y is the noise N(0,r 2 ) We know Z = z, then X is in the range [min(z-Y), max(z-y)] z The parameters are Estimated based on many z samples Samples’ average contribution to

 EM algorithm has properties Unique global optimal solution ^ converges to the MLE solution

Evaluating loss of information  The information that additive perturbation wants to preserve Column distribution  First metric Difference between the estimate and the original distribution

Evaluating loss of information  Ultimate utility metric Modeling quality  The accuracy of classifier, if used for classification modeling Evaluation method  Accuracy of the classifier trained on the original data  Accuracy of the classifier trained on the reconstructed distribution

Data Mining with Additive Perturbation  Example: decision tree  A brief introduction to decision tree algorithm There are many versions… One version working on continuous attributes

 Split evaluation gini(S) = 1- sum(pj^2)  Pj is the relative frequency of class j in S gini_split(S) = n1/n*gini(S1)+n2/n*gini(S2) The smaller the better  Procedure Get the distribution of each attribute Scan through each bin in the attribute and calculate the gini_split index  problem: how to determine pj The reconstruction algorithm applies … x

 An approximate method to determine pj The original domain is partitioned to m bins Reconstruction gives an distribution over the bins  n1, n2,…nm Sort the perturbed data by the target attribute assign the records sequentially to the bins according to the distribution Look at the class labels associated with the records  Errors happen because we use perturbed values to determine the bin identification of each record

When to reconstruct distribution  Global – calculate once  By class – calculate once per class  Local – by class at each node  Empirical study shows By class and Local are more effective

Problems with these studies  Privacy evaluation Didn’t consider attacking methods  Methods used to reconstruct the original data  Mostly used in signal processing  Loss of information (or utility) Negatively related to privacy Not directly related to modeling  Accuracy of distribution reconstruction vs. accuracy of classifier ?

Summary  We discussed the basic methods with additive perturbation Definition Privacy metrics Distribution reconstruction  The problem with privacy evaluation is not complete Attacks Covered by next class