Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.

Slides:

Advertisements

Similar presentations

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.

Estimation of Means and Proportions

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

ESTIMATION AND HYPOTHESIS TESTING

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.

SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Evaluating Hypotheses

1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.

Fast Algorithms for Association Rule Mining

Mining Association Rules

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

7-2 Estimating a Population Proportion

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

An Experimental Study of Association Rule Hiding Techniques Emmanuel Pontikakis* Dept. of Computer Engineering and Informatics.

Basic Data Mining Techniques

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Slide 1 Copyright © 2004 Pearson Education, Inc..

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.

Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.

Chapter 7 Estimates and Sample Sizes

Secure Incremental Maintenance of Distributed Association Rules.

PARAMETRIC STATISTICAL INFERENCE

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.

1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.

Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Chapter 8 : Estimation.

Mining Quantitative Association Rules in Large Relational Tables ACM SIGMOD Conference 1996 Authors: R. Srikant, and R. Agrawal Presented by: Sasi Sekhar.

Association Rule Mining

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.

1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.

Elsayed Hemayed Data Mining Course

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Chapter 7 Estimates and Sample Sizes 7-1 Overview 7-2 Estimating a Population Proportion 7-3 Estimating a Population Mean: σ Known 7-4 Estimating a Population.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.

Chapter 6 Sampling and Sampling Distributions

Security in Outsourcing of Association Rule Mining

Privacy-Preserving Data Mining

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Stratified Sampling for Data Mining on the Deep Web

Privacy preserving cloud computing

Lecture Slides Elementary Statistics Twelfth Edition

Presentation transcript:

Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center Cornell University

Data Mining and Privacy The primary task in data mining: development of models about aggregated data. Can we develop accurate models without access to precise information in individual data records?

Data Mining and Privacy The primary task in data mining: development of models about aggregated data. Can we develop accurate models without access to precise information in individual data records? Answer: yes, by randomization. –R. Agrawal, R. Srikant “Privacy Preserving Data Mining,” SIGMOD 2000 –for numerical attributes, classification How about association rules?

Randomization Overview Recommendation Service Alice Bob B. Spears, baseball, cnn.com, … J.S. Bach, painting, nasa.gov, … Chris B. Marley, camping, linux.org, …

Recommendation Service Alice Bob J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … B. Spears, baseball, cnn.com, … B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … B. Marley, camping, linux.org, … B. Spears, baseball, cnn.com, … J.S. Bach, painting, nasa.gov, … Chris B. Marley, camping, linux.org, … Randomization Overview

Recommendation Service Associations Recommendations Alice Bob J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … B. Spears, baseball, cnn.com, … B. Spears, baseball, cnn.com, … B. Marley, camping, linux.org, … B. Marley, camping, linux.org, … B. Spears, baseball, cnn.com, … J.S. Bach, painting, nasa.gov, … Chris B. Marley, camping, linux.org, … Randomization Overview

Recommendation Service Associations Recommendations Alice Bob Metallica, painting, nasa.gov, … Metallica, painting, nasa.gov, … B. Spears, soccer, bbc.co.uk, … B. Spears, soccer, bbc.co.uk, … B. Marley, camping, microsoft.com … B. Marley, camping, microsoft.com … B. Spears, baseball, cnn.com, … J.S. Bach, painting, nasa.gov, … Support Recovery Chris B. Marley, camping, linux.org, … Randomization Overview

Associations Recap A transaction t is a set of items (e.g. books) All transactions form a set T of transactions Any itemset A has support s in T if Itemset A is frequent if s  s min If A  B, then supp (A)  supp (B).

Associations Recap A transaction t is a set of items (e.g. books) All transactions form a set T of transactions Any itemset A has support s in T if Itemset A is frequent if s  s min If A  B, then supp (A)  supp (B). Example: –20% transactions contain beer, –5% transactions contain beer and diapers; –Then: confidence of “beer  diapers” is 5/20 = 0.25 = 25%.

The Problem How to randomize transactions so that –we can find frequent itemsets –while preserving privacy at transaction level?

Talk Outline Introduction Privacy Breaches Our Solution Experiments Conclusion

Uniform Randomization Given a transaction, –keep item with 20% probability, –replace with a new random item with 80% probability.

Example: {x, y, z} 1% have {x, y, z} 5% have {x, y}, {x, z}, or {y, z} only 10 M transactions of size 10 with 10 K items: 94% have one or zero items of {x, y, z}

Example: {x, y, z} 1% have {x, y, z} 5% have {x, y}, {x, z}, or {y, z} only 10 M transactions of size 10 with 10 K items: 94% have one or zero items of {x, y, z} Uniform randomization: How many have {x, y, z} ?

Example: {x, y, z} 1% have {x, y, z} 5% have {x, y}, {x, z}, or {y, z} only 10 M transactions of size 10 with 10 K items: 94% have one or zero items of {x, y, z} 0.008% 800 ts % 16 trans. less than % 2 transactions Uniform randomization: How many have {x, y, z} ? /10, at most 0.2 (9/10,000) 2

Example: {x, y, z} 1% have {x, y, z} 5% have {x, y}, {x, z}, or {y, z} only 10 M transactions of size 10 with 10 K items: 94% have one or zero items of {x, y, z} 0.008% 800 ts. 97.8% % 16 trans. 1.9% less than % 2 transactions 0.3% Uniform randomization: How many have {x, y, z} ? /10, at most 0.2 (9/10,000) 2

Example: {x, y, z} Given nothing, we have only 1% probability that {x, y, z} occurs in the original transaction Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one. This is what we call a privacy breach. Uniform randomization preserves privacy “on average,” but not “in the worst case.”

Privacy Breaches Suppose: –t is an original transaction; –t’ is the corresponding randomized transaction; –A is a (frequent) itemset. Definition: Itemset A causes a privacy breach of level  (e.g. 50%) if, for some item z  A, –Assumption: no external information besides t’.

Talk Outline Introduction Privacy Breaches Our Solution Experiments Conclusion

Our Solution Insert many false items into each transaction Hide true itemsets among false ones “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton

Our Solution Insert many false items into each transaction Hide true itemsets among false ones Can we still find frequent itemsets while having sufficient privacy? “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton

Definition of cut-and-paste Given transaction t of size m, construct t’ : a, b, c, u, v, w, x, y, z t = t’ =

Definition of cut-and-paste Given transaction t of size m, construct t’ : –Choose a number j between 0 and K m (cutoff); a, b, c, u, v, w, x, y, z t = t’ = j = 4

Definition of cut-and-paste Given transaction t of size m, construct t’ : –Choose a number j between 0 and K m (cutoff); –Include j items of t into t’ ; a, b, c, u, v, w, x, y, z t = b, v, x, z t’ = j = 4

Definition of cut-and-paste Given transaction t of size m, construct t’ : –Choose a number j between 0 and K m (cutoff); –Include j items of t into t’ ; –Each other item is included into t’ with probability p m. The choice of K m and p m is based on the desired level of privacy. a, b, c, u, v, w, x, y, z t = b, v, x, z t’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, … j = 4

Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is –Here s k is the same as the support of A. –Randomized partial supports are denoted by

Transition Matrix Let k = |A|, m = |t|. Transition matrix P = P (k, m) connects randomized partial supports with original ones: Randomized supports are distributed as a sum of multinomial distributions.

The Unbiased Estimators Given randomized partial supports, we can estimate original partial supports:

The Unbiased Estimators Given randomized partial supports, we can estimate original partial supports: Covariance matrix for this estimator: To estimate it, substitute s l with (s est ) l. –Special case: estimators for support and its variance

Class of Randomizations Our analysis works for any randomization that satisfies two properties: –A per-transaction randomization applies the same procedure to each transaction, using no information about other transactions; –An item-invariant randomization does not depend on any ordering or naming of items.

Class of Randomizations Our analysis works for any randomization that satisfies two properties: –A per-transaction randomization applies the same procedure to each transaction, using no information about other transactions; –An item-invariant randomization does not depend on any ordering or naming of items. Both uniform and cut-and-paste randomizations satisfy these two properties.

Apriori Let k = 1, candidate sets = all 1-itemsets. Repeat: 1.Count support for all candidate sets 2.Output the candidate sets with support  s min 3.New candidate sets = all (k + 1) -itemsets s.t. all their k -subsets are candidate sets with support  s min 4.Let k = k + 1 Stop when there are no more candidate sets.

The Modified Apriori Let k = 1, candidate sets = all 1-itemsets. Repeat: 1.Estimate support and variance (σ 2 ) for all candidate sets 2.Output the candidate sets with support  s min 3.New candidate sets = all (k + 1) -itemsets s.t. all their k -subsets are candidate sets with support  s min - σ 4.Let k = k + 1 Stop when there are no more candidate sets, or the estimator’s precision becomes unsatisfactory.

Privacy Breach Analysis How many added items are enough to protect privacy? –Have to satisfy Pr [z  t | A  t’] <  (  no privacy breaches) –Select parameters so that it holds for all itemsets. –Use formula ( ):

Privacy Breach Analysis How many added items are enough to protect privacy? –Have to satisfy Pr [z  t | A  t’] <  (  no privacy breaches) –Select parameters so that it holds for all itemsets. –Use formula ( ): Parameters are to be selected in advance! –Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support. –Enough to know maximal support of an itemset for each size.

Graceful Tradeoff Want more precision or more privacy? –Adjust privacy breach level –A small relaxation of privacy restrictions results in a small increase in precision of estimators.

Talk Outline Introduction Privacy Breaches Our Solution Experiments –Support recovery vs. parameters –Real-life data Conclusion

Lowest Discoverable Support LDS is s.t., when predicted, is 4  away from zero. Roughly, LDS is proportional to |t| = 5,  = 50%

LDS vs. Breach Level |t| = 5, |T| = 5 M Reminder: breach level is the limit on Pr [z  t | A  t’]

Talk Outline Introduction Privacy Breaches Our Solution Experiments –Support recovery vs. parameters –Real-life data Conclusion

Real datasets: soccer, mailorder Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests. –11 K items (HTMLs), 6.5 M transactions –Available at Mailorder is a purchase dataset from a certain on-line store –Products are replaced with their categories –96 items (categories), 2.9 M transactions A small fraction of transactions are discarded as too long. –longer than 10 (for soccer) or 7 (for mailorder)

Modified Apriori on Real Data Itemset Size True Itemsets True Positives False Drops False Positives Itemset Size True Itemsets True Positives False Drops False Positives Soccer: s min = 0.2%   0.07% for 3-itemsets Mailorder: s min = 0.2%   0.05% for 3-itemsets Breach level = 50%. Inserted 20-50% items to each transaction.

False Drops False Positives Size<  Size<  Size<  Size<  Soccer Mailorder Pred. supp%, when true supp  0.2% True supp%, when pred. supp  0.2%

Actual Privacy Breaches Verified actual privacy breach levels The breach probabilities are counted in the datasets for frequent and near-frequent itemsets. If maximum supports were estimated correctly, even worst-case breach levels fluctuated around 50% –At most 53.2% for soccer, –At most 55.4% for mailorder.

Talk Outline Introduction Privacy Breaches Our Solution Experiments Conclusion

Summary Privacy breaches: identified problem and provided a solution for controlling breaches Derived estimators of support and variance for a class of randomization operators Algorithm for discovering associations in randomized data Validated on real-life datasets Can find associations while preserving privacy at the level of individual transactions

Future Work Control of more general privacy breaches –What about other properties of transactions, for example item z  t breach caused by A  t’ =  ? –What about external information? Theoretical limits of discoverability for a given privacy breach level –How to compute theoretical limits? –How to attain them by an algorithm?

Thank You!

BACK-UPS

Our Solution: Example Old set-up: –Given 10,000 items, 10 M transactions of size 10 –100,000 transactions (1%) contain A = {x, y, z} In addition to uniform randomization with p = 80%, insert 500 new random items to each transaction. –~ 800 transactions contain {x, y, z} before and after; –Roughly (10 M) (500 / 10,000) 3 = 1250 transactions contain none before and full {x, y, z} after. Presence of {x, y, z} in a randomized transaction now says little about the original transaction.

Privacy Breach Analysis GIVEN: itemset A, and item z  A WANTED: Assume that partial supports are probabilities: Define: Then we have:

Limiting Privacy Breaches We want to make sure that always But we do not know supports in advance. Solution: For each itemset size k, give “privacy- challenging” test values to. –It is an itemset whose subsets have maximum supports –We need to estimate maximum support values prior to randomization

LDS vs. Transaction Size  = 50%, |T| = 5 M Too long transactions cannot be used for prediction

Related Work R. Agrawal, R. Srikant “Privacy Preserving Data Mining,” SIGMOD 2000: Each client has a numerical attribute x i Client i sends x i + y i, where y i = random offset, with known distribution Server reconstructs the distribution of original attributes (~ EM algorithm) The distribution is then used for classification –Numerical attributes only

Related Work Y. Lindell and B. Pinkas “Privacy Preserving Data Mining,” Crypto 2000 J. Vaidya and C. Clifton “Privacy Preserving Association Rule Mining in Vertically Partitioned Data” …

Privacy Concern Popular press: –“The End of Privacy”, “The Death of Privacy” Government directives: –European directive on privacy protection (Oct 98) –Canadian Personal Information Protection Act (Jan 2001) Surveys of Web users: –17% fundamentalists, 56% pragmatic majority, 27% marginally concerned (April 99) –82% said having privacy would matter (July 99)