1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.

Slides:



Advertisements
Similar presentations
Secure Multiparty Computations on Bitcoin
Advertisements

Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Introduction to Probability
Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Discrete random variables Probability mass function Distribution function (Secs )
ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,
Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Visual Recognition Tutorial
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Thanks to Nir Friedman, HU
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Information Theory and Security
Probability and Statistics Review Thursday Sep 11.
Chapter6 Jointly Distributed Random Variables
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Game Theory and Privacy Preservation in Recommendation Systems Iordanis Koutsopoulos U of Thessaly Thalis project CROWN Kick-off Meeting Volos, May 11,
Chapter 1 Probability and Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Secure Incremental Maintenance of Distributed Association Rules.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
2.1 Introduction In an experiment of chance, outcomes occur randomly. We often summarize the outcome from a random experiment by a simple number. Definition.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Optimal Bayes Classification
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Mining Quantitative Association Rules in Large Relational Tables ACM SIGMOD Conference 1996 Authors: R. Srikant, and R. Agrawal Presented by: Sasi Sekhar.
Secure Conjunctive Keyword Search Over Encrypted Data Philippe Golle Jessica Staddon Palo Alto Research Center Brent Waters Princeton University.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Slide 1 Vitaly Shmatikov CS 380S Privacy-Preserving Data Mining.
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Business Statistics,
Probability Distribution. Probability Distributions: Overview To understand probability distributions, it is important to understand variables and random.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
CHAPTER 2 RANDOM VARIABLES.
Parameter Estimation 主講人:虞台文.
On Communication Protocols that Compute Almost Privately
Representing Uncertainty
Privacy Preserving Data Mining
The Curve Merger (Dvir & Widgerson, 2008)
Quantum Information Theory Introduction
LECTURE 07: BAYESIAN ESTIMATION
Cryptography Lecture 5.
Presentation transcript:

1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database Systems San Diego, CA, June 2003 ( PODS 2003 ) Alexandre Evfimievsk Johannes Gehrke Ramakrishnan Srikant Cornell University Cornell University IBM Slmaden Research Center

2 Introduction Two broad approach in privacy preserving – secure multi-party computation approach – randomization approach – building classification models over randomized data – discover association rules over randomized data

3 Introduction Privacy We must ensure that the randomization is sufficient for preserving privacy e.g randomize age x i by adding r i ( drawn uniformly from a segment[-50, 50] ) assuming that the server receives age 120 from a user than the server has learn that the real age of the user >= 70

4 Introduction Two approaches for quantifying how privacy preserving a randomization method –Information theory –Privacy breaches

5 overview The Model N clients C 1,…C N connected to one server ; each C i has private x i To ensure privacy, each C i sends a modified y i of x i to server The server collects the modified information and recover the statistical properties

6 overview Assumptions x i € V X, V X is a finite set each x i is chosen independently at random according to the same fixed probability distribution px (not private)

7 overview Randomization randomization operator R(x) y i is an instance of R(x i ), is send to the server All possible outputs of R(x) is denoted by V Y, V Y is a finite set For all x € V X and y € V Y, the probability that R(x) outputs y is denoted by

8 outline Refined Definition of Privacy Breaches Amplification Itemset Randomization Compression of Randomized Transactions Worst- Case Information

9 Privacy breaches Each possible value x of C i ’s private information has probability px(x) Define a random variable X such that The randomized value y i is an instance of a random variable Y such that The joint distribution of X and Y is

10 Privacy breaches Any property Q(x), Q : V x  { true, false}

11 Privacy breaches example x between 0 ~ R 1 (x) = x 20%, otherwise 80% (uniformly) 2.R 2 (x) = x +  (mod 1001),  in {-100 ~ 100} (uniformly) 3.R 3 (x) be R 2 (x) 50%, otherwise 50% (uniformly)

12 Privacy breaches 1%  71.6% 40.5%  100%

13 Privacy breaches Some property has very low prior probability but becomes likely once we learn that R(X) = y 1%  71.6% Some property has a probability far from 100% but becomes almost 100%-probable 40.5%  100%

14 Privacy breaches Let  1,  2 be two probabilities such that  1 corresponds to our intuitive notion of “very unlikely” whereas  2 corresponds to likely

15 outline Refined Definition of Privacy Breaches Amplification Itemset Randomization Compression of Randomized Transactions Worst- Case Information

16 Amplification Use Def 1 to check privacy breaches 1. There are 2 |VX| possible properties check all ? 2. Without px of X, how can we use Def 1 ?

17 Amplification

18 Amplification

19 Amplification Proof : Assume that eor property Q(x) we have a ρ 1 to ρ 2 privacy breach

20 Amplification

21 Amplification

22 outline Refined Definition of Privacy Breaches Amplification Itemset Randomization Compression of Randomized Transactions Worst- Case Information

23 Itemset Randomization Assume that all transaction have same size m and each transaction is an independent instance Select–a–size (with parameters: 0 < ρ < 1 and ) 1.Selects an integer j at random from {0, 1, …, m} defined p [j] = P [j is chosen] p [j] 2.Select j item from t, uniformly at random, put them into t’ => |t∩t’| = j 1/(m, j) 3.a !€ t, tosses a coin, P [head] = ρ, if head added to t’ ρ m’-j (1- ρ) n-m-(m’-j)

24 Itemset Randomization Denote t’ = R(t), m’ = |t’|, j = |t∩t|, n = | I |

25 Itemset Randomization

26 Itemset Randomization Frequent ?? Trying to have more items of t in t’ Give ρ, focus on p[j]’s Maximizing the following expectation

27 Itemset Randomization Select parameters ρ and to select ρ and j*

28 outline Refined Definition of Privacy Breaches Amplification Itemset Randomization Compression of Randomized Transactions Worst- Case Information

29 Compressing randomized transactions Randomized transactions are large - Network resource - Lots of memory

30 Compressing randomized transactions A (Seed, n, q, ρ) - pseudorandom generator is a function G : Seed * {1,….,n} → {0, 1} that has following properties -  i : P [G( ξ, i ) = 1 | ξ€ r Seed] = ρ -  1 ≤ i 1 < … < i q ≤ n, G( ξ, i 1 ), G( ξ, i 2 ), … G( ξ, i q ), are statistically independent

31 Compressing randomized transactions We are going to represent a randomized transaction by a seed ξ€ Seed G( ξ, i ) = 1 means that item i belongs to the randomized transaction There is a mapping τ from seeds to transactions τ( ξ) = { item i | G( ξ, i ) = 1 } The set Seed : Boolean strings {0, 1} k, k << n

32 Compressing randomized transactions Another randomization operator similar to select - a - size, has parameters: 0 < ρ < 1 and Given transaction t, a (Seed, n, q, ρ) - pseudorandom generator with q ≥ m (size of t), The operator generates the seed = R’( t ) in three steps

33 Compressing randomized transactions 1.Selects an integer j at random from {0, 1, …, m} defined p [j] = P [j is chosen] 2.Select j item from t, uniformly at random, put them into t’, W.L.O.G assume t[1], [2], … t[j] are selected 3.Select a random seed ξ € Seed such that

34 outline Refined Definition of Privacy Breaches Amplification Itemset Randomization Compression of Randomized Transactions Worst- Case Information

35 Worst – Case information X random variable, Y = R(x) Random variable The mutual information I ( X ; Y ) is I(X ; Y)   Privacy  KL(p 1 || p 2 ) is Kullback-Leibler distance between the distribution p 1 (x) and p 2 (x) of two random variable

36 Worst – Case information e.g V x = {0, 1} P[ X = 0 ] = P[ X = 1] = ½ Y 1 = R 1 (X), Y 2 =R 2 (X) P[Y 1 = x | X = x] = 0.6 P[Y 1 = 1-x | X = x] = 0.4 P[Y 2 = e | X = x] = P[Y 2 = x | X = x] = 99*10 -6 P[Y 2 = 1-x | X = x] = 1*10 -6 I(X ; Y 2 ) << I(X ; Y 1 ) ????

37 Worst – Case information

38 Worst – Case information Revealing R(X) = y for some y cause ρ 1 to ρ 2 privacy breach Revealing R(X) = y for some y cause ρ 2 to ρ 1 privacy breach

39 Conclusion New definition of privacy breaches A general approach amplification Compressing long randomized transactions by using pseudorandom generators Defined several new information theoretical

40 Future work Continuous distribution Tradeoff between privacy and accuracy Combine randomization and secure multi- party computation approaches