1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Naïve Bayes Classifier
Naïve Bayes Classifier
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Statistical Methods Chichang Jou Tamkang University.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Visual Recognition Tutorial
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification.
Computer vision: models, learning and inference
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Crash Course on Machine Learning
Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.
Basic Data Mining Techniques
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Principles of Pattern Recognition
Bayesian Networks. Male brain wiring Female brain wiring.
Block ciphers 2 Session 4. Contents Linear cryptanalysis Differential cryptanalysis 2/48.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Statistical Decision Theory
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
Refined privacy models
Universit at Dortmund, LS VIII
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 University of Texas at Austin Machine Learning Group 图像与视频处理 计算机学院 Motion Detection and Estimation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Privacy-preserving data publishing
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Classification Today: Basic Problem Decision Trees.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
CS240A Final Project 2.
Naïve Bayes Classifier
Naive Bayes Classifier
Bayesian Classification
Data Mining Lecture 11.
Vapnik–Chervonenkis Dimension
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Discriminative Frequent Pattern Analysis for Effective Classification
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Image and Video Processing
Generally Discriminant Analysis
Machine Learning: UNIT-3 CHAPTER-1
Generative Models and Naïve Bayes
A task of induction to find patterns
Refined privacy models
Presentation transcript:

1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian

2 Outline Motivation Brief background on NBC Privacy breach for views Transformation from unsafe views to safe views Extension for arbitrary prior distributions Experiments Conclusion

3 Motivation PPDM methods seek to achieve the benefits from data mining on the data, without compromising privacy of individuals in the data. –data collection phase –data publishing phase –data mining phase

4 Motivation Privacy breaches when publishing NBCs –Bob knows that Alice lives on Westwood and she is in 40s –Bob’s prior belief on Alice earning 70K was 5/7 = 71% –After seeing the views, Bob infers that with a probability of 1/10 × (4/5 + 4×3/4 + 5×1) = 88% Alice earns a 70K salary.

5 Motivation Publishing better views –Bob’s posterior belief 1/6 × (2/3+1/2+1/ ) = 78% –71%-to-78% is safer than 71%-to-88%

6 Motivation Achieve same classification results –Test input is –The NBC built on V1 predicts the class label as 50K, because 5/7×1/5×1/5 < 2/7×1/2×1/2 –The prediction from the second classifier (built on V2) is again 50K, because 3/5×1/3×1/3 < 2/5×1/2×1/2

7 Motivation NBC has proved to be one of the most effective classifiers in practice and in theory. Given an unsafe NBC, it is possible to find an equivalent one that is safer to publish. The objective is determining whether a set of NBC- enabling views are safe to publish And if not, how to find a secure database that produces the same NBC model satisfying privacy requirements.

8 Brief Background on NBC The original database T is an instance of a relation In order to build an NBC, the only views that need to be published are for all, and Equivalent to publishing these views, one can instead publish the following counts. For,,,

9 Brief Background on NBC Using these counts, we can express the NBC’s probability estimation as follows. For all and for all, the NBC’s prediction is:

10 Privacy Breach for Views Prior and posterior knowledge where Quasi-identifier: Family of all table instances: all instances satisfying the given views:

11 For a given table T, publishing V(T) = V 0 causes a privacy breach with respect to a pair of given constants 0 < L 1 < L 2 < 1, if either of the following holds: or, For example, 0.5-to-0.8 does not satisfy the privacy requirement L 1 = 0.51 and L 2 = 0.8, but 0.5-to-0.78 does. Privacy Breach for Views

12 assume a uniform distribution of the database instances; assume a uniform distribution of class values. Privacy Breach for Views

13 Privacy Breach for Views Let I 0 be the value of a given quasi-identifier I, and let V 0 be the value of a given view V(T). If there exist some m 1,m 2 > 0 such that for all : then for any c and any pair of L 1,L 2 > 0 publishing V 0 will not cause any privacy breaches w.r.t. L1 and L 2, provided that the following amplification criterion holds:

14 Privacy Breach for Views For a given quasi-identifier I = I 0, a given view V(T) = V 0 is safe to publish against any L 1 -to-L 2 privacy breaches, if there exists such that the following conditions hold: and for all : select the largest possible for a given, recast the privacy goal as that of checking/enforcing the second condition

15 Privacy Breach for Views With respect to a given I 0 as the value of a quasi- identifier I, and a given amplification ratio, the viewset (P,N) is safe to publish, if for all, and, the following conditions hold:

16 Privacy Breach for Views Two observations –all quasi-identifiers that have the same cardinality (i.e., number of attributes) can be blocked at the same time, since the conditions are functions of |I|, and not of I or I 0. –all privacy breaches for all quasi-identifiers of any cardinality can be blocked by simply blocking the one with largest cardinality, namely n, because

17 Privacy Breach for Views With respect to a given amplification ratio, the viewset (P,N) is safe to publish, if for all, and, the following conditions hold:

18 Transformation from unsafe views to safe views NBC-Equivalence Let f and f’ be two functions that map each element of to a non-negative real number. We call f and f’ NBC-equivalent, if

19 Transformation from unsafe views to safe views Transformation algorithms –Input: V is the given view set consisting of and ; amplification ratio –Description: Step 1: Replace all those that are 0 to non-zero Step 2: Scale down all to new rational numbers that satisfy the given amplification ratio Step 3: Adjust the numbers such that again Step 4: Normalize the numbers or turn them into integers –Output: V (1)Raising all the counts to the same power does not change the classification; (2)In other words a set of NBC-equivalent viewsets is closed under exponentiation. Example: 100 and 16, 10> >10-4

20 Extension for arbitrary prior distributions See an tiny example

21 Experiments Adult dataset containing 32,561 tuples The attributes used were Age, Years of education, Work hours per week, and Salary. an NBC trained on the k-anonymous data vs. an NBC trained on the output of Safety Views Transformation

22 Conclusion Reformulated privacy breach for view publishing Presented sufficient conditions that are easy to check/enforce Provided algorithms that guarantee the privacy of the individuals who provided the training data, and incur zero accuracy loss in terms of building an NBC.