1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.

Slides:



Advertisements
Similar presentations
Data Mining: Potentials and Challenges Rakesh Agrawal & Jeff Ullman.
Advertisements

Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Privacy-Preserving Databases and Data Mining Yücel SAYGIN
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
LOGO Association Rule Lecturer: Dr. Bo Yuan
Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Fast Algorithms for Association Rule Mining
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
CSE 634 Data Mining Techniques Association Rules Hiding (Not Mining) Prateek Duble ( ) Course Instructor: Prof. Anita Wasilewska State University.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
An Experimental Study of Association Rule Hiding Techniques Emmanuel Pontikakis* Dept. of Computer Engineering and Informatics.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Additive Data Perturbation: the Basic Problem and Techniques.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Data Mining and Decision Support
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,
Security in Outsourcing of Association Rule Mining
Frequency Counts over Data Streams
Privacy-Preserving Data Mining
Privacy Preserving Data Publishing
Market Basket Many-to-many relationship between different objects
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Stratified Sampling for Data Mining on the Deep Web
Presented by : SaiVenkatanikhil Nimmagadda
LECTURE 05: THRESHOLD DECODING
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002

2 Content Background Problem framework MASK -- distortion part MASK -- mining part Performance Conclusion

3 Background In data mining, the accuracy of the input data is very important for obtaining valuable mining results. However, in real life, there are many reasons which lead to inaccurate data. One example is that, the users deliberately provide wrong information to protect their privacy. – age, income, illness, etc. Problem: how to protect user privacy while getting accurate mining results at the same time?

4 Background (cont’d) Privacy and accuracy are contradictory in nature. A compromise way is more feasible. – satisfactory (not 100%) privacy and satisfactory (not 100%) accuracy This paper studied this problem in the context of mining association rules.

5 Overview of the Paper The authors proposed a scheme --- MASK (Mining Associations with Secrecy Konstraints). Major idea of MASK – Apply a simple probabilistic distortion on original data The distortion can be done at the user machine – The miner tries to find accurate mining results, given the following inputs: The distorted data A description of the distortion procedure

6 Problem Framework Database model – Each customer transaction is a record in the database. – A record is a fixed-length sequence of 1’s and 0’s. E.g: for market-basket data – length of the record: the total number of items sold by the market. – 1: the corresponding item was bought in the transaction – 0: vice versa. – The database can be regarded as a two-dimensional boolean matrix.

7 Problem Framework (cont’d) The matrix is very sparse. Why not use item- lists? – The data will be distorted. – After the distortion, it will not as sparse as the original (true) data. Mining objective: find frequent itemsets – Itemset whose appearance (support) in the database is larger than a threshold.

8 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion

9 MASK --- Distortion Part Distortion Procedure – Represent a customer record by a random vector. – Original record: X={X i }, where X i =0 or 1. – Distorted record: Y={Y i }, where Y i =0 or 1. Y i = X i (with a probability of p) Y i = 1-X i (with a probability of 1-p)

10 Quantifying Privacy Privacy metric – The probability of reconstructing the true data – Consider each individual item With what probability can a given 1 or 0 in the true matrix database be reconstructed? Calculate reconstruction probability – Let s i = prob (a random customer C bought the i th item) = the true support of item i – The probability of correctly reconstruction of a ‘1’ in a random item i is: R 1 (p,s i )= s i x p 2 / (s i x p +(1-s i ) x (1-p) ) + s i x (1-p) 2 / ( s i x (1-p) + (1-s i ) x p)

11 Reconstruction Probability Reconstruction probability of a ‘1’ across all items: R 1 (p) = (  i s i R 1 (p,s i ) ) / (  i s i ) Suppose – s 0 =the average support of an item Replace s i by s 0, we get – R 1 (p)= s 0 x p 2 / (s 0 x p +(1-s 0 ) x (1-p) ) + s 0 x (1-p) 2 / ( s 0 x (1-p) + (1-s 0 ) x p)

12 Reconstruction Probability (cont’d) Relationship between R 1 (p) and p, s 0 Observations: – R 1 (p) is high when p is near 0 and 1, and it is lowest when p=0.5. – The curves become flatter as s 0 decreases.

13 Privacy Measure The reconstruction probability of a ‘0’ – R 0 (p)= func(p and s 0 ). The total reconstruction probability – R(p)=a R 1 (p) +(1-a) R 0 (p) – a is the weight parameter. Privacy – P(p) = ( 1- R(p) ) x 100

14 Privacy Measure (cont’d) Privacy vs. p Observations: – For a given value of s 0, the curve shape is fixed. The value of a determines the absolute value of privacy. – The privacy is nearly constant for a large range of p. provide flexibility in choosing p that can minimize the error in the later mining part. P(p) for s 0 =0.01

15 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion

16 MASK --- Mining Part How to estimate the accurate supports of itemsets from a distorted database? – Remember that the miner knows the value of p. Estimating 1-itemset supports Estimating n-itemset supports The whole mining process

17 Estimating 1-itemset Supports Symbols: – T: the original true matrix; D: the distorted matrix; – i: a random item; – C 1 T and C 0 T : the number of 1’s and 0’s in the i column of T; – C 1 D and C 0 D : the number of 1’s and 0’s in the i column of D. From distortion method, we have – C 1 D : roughly C 1 T p+ C 0 T (1-p) -> C 1 D = C 1 T p+ C 0 T (1-p) – C 0 D : roughly C 0 T p+ C 1 T (1-p) -> C 0 D = C 0 T p+ C 1 T (1-p) Let,,, then C D = MC T. So C T = M -1 C D.

18 Estimating n-itemset Supports Still use C T = M -1 C D to estimate support. Define – C K T is the number of records in T that have the binary form of k for the given itemset. E.g: for a 3-itemset that contains the first 3 items – C T has 2 3 =8 rows – C 3 T is the No. of records in T of form {0,1,1,…} M i,j = Prob ( C j T -> C i D ). – M 7,3 =p 2 (1-p) (C 3 T -> C 7 D or C 011 T -> C 111 D )

19 Mining Process Similar to Apriori algorithm Difference: – E.g: when counting supports of 2-itemsets, Apriori only need to count the No. of records that have value ‘1’ for both items, or of form “11”. MASK has to keep track of all 4 combinations: 00,01,10 and 11 for the corresponding items. – C 2 n -1 T is estimated from C 0 D, C 1 D, …, C 2 n -1 D. MASK requires more time and space than Apriori. – Some optimizations (omitted)

20 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion

21 Performance Data sets – Synthetic database 1,000,000 records; 1000 items s 0 =0.01 – Real dataset Click-stream data of a retailer web site 600,000 records; about 500 items s 0 =0.005

22 Performance (cont’d) Error Metrics – Right class, wrong support Infrequent itemsets, error doesn’t matter Frequent itemsets – Support Error (  ): – Wrong class Identity Error (  ) – false positives: – false negatives:

23 Performance (cont’d) Parameters – sup = 0.25%, 0.5% – p = 0.9, 0.7 – a=1: only concern of privacy of 1’s – r = 0%, 10% Coverage may be more important than precision. Use a smaller support threshold to mine the distorted database. Support used to mine D = sup x (1-r)

24 Performance (cont’d) Synthetic dataset – Experiment 1: p=0.9 (85%), sup=0.25% Level|F| -- + Level|F| -- + r=0%r=10%

25 Performance (cont’d) Synthetic dataset – Experiment 2: p=0.9 (85%), sup=0.5% Level|F| -- + Level|F| -- + r=0%r=10%

26 Performance (cont’d) Synthetic dataset – Experiment 3: p=0.7 (96%), sup=0.25%, r=10% Level|F| -- +

27 Performance (cont’d) Real database – Experiment 1: p=0.9 (89%), sup=0.25% Level|F| -- + Level|F| -- + r=0%r=10%

28 Performance (cont’d) Real database – Experiment 2: p=0.9 (89%), sup=0.5% Level|F| -- + Level|F| -- + r=0%r=10%

29 Performance (cont’d) Real database – Experiment 3: p=0.7 (97%), sup=0.25%, r=10% Level|F| -- +

30 Performance (cont’d) Summary – Good privacy and good accuracy can be achieved at the same time by careful selection of p. – In experiments, p around 0.9 is the best choice. – A smaller p leads to much error in mining results. – A larger p will reduces the privacy greatly.

31 Conclusion This paper studies the problem of achieving a satisfactory privacy and accuracy simultaneously for association rule mining. A probabilistic distortion of the true data is proposed. Privacy is measured by a formula, which is a function of p and s 0.

32 Conclusion (cont’d) A mining process is put forward to estimate the real support from the distorted database. Experiment results show that there is a small window of p (near 0.9) that can achieve good accuracy (90%+) and privacy (80%+) at the same time.

33 Related Works On preventing sensitive rules from being inferred by the miner (output privacy) – Y. Saygin, V. Verykios and C. Clifton, “Using Unknowns to Prevent Discovery of Association Rules”, ACM SIGMOD Record, vol.30 no. 4, 2001 – M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim and V. Verykios, “Disclosure Limitation of Sensitive Rules”, Proc. Of IEEE Knowledge and Data Engineering Exchange Workshop, Nov.1999

34 Related Works On input data privacy in distributed databases – J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD2002 – M. Kantarcioglu and C. Clifton, “Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, Proc. Of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2002

35 Related Works Privacy-preserving mining in the context of classification rules – D. Agrawal and C. Aggarwal, “On the Design and Quantification of Privacy Preserving Data Mining Algorithms”, PODS, 2001 A recent paper also appears in 2002 – A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke, “Privacy Preserving Mining of Association Rules”, KDD2002

36 ?

37 More information Distortion procedure – Y i = X i XOR r i ‘, where r i ‘ is the complement of r i, r i is a random variable with density function f ( r ) = bernoulli(p) (0 <= p <= 1)

38 More Information Reconstruction error bounds (1-itemsets) – With probability P E (m,p,(2p-1)  /2) X P E (n,p,(2p- 1)  /2), the error is less than . n: the real support count of the item; m: dbsize-n; P E (n,p,  ) =  ( r=np-  np+  ) n C r p r (1-p) n-r

39 Reconstruction probability of a ‘1’ in a random item i – S i = the true support of item i = pr (a random customer C bought the i th item), X i = the original entry for item i Y i = the distorted entry for item I – The probability of correct reconstruction of a ‘1’ in a random item i is: R 1 (p,s i )= Pr{Y i =1| X i =1} x pr{X i =1| Y i =1} + Pr{Y i =0| X i =1} x Pr{X i =1| Y i =0} = s i x p 2 / (s i x p +(1-s i ) x (1-p) ) + s i x (1-p) 2 / ( s i x (1-p) + (1-s i ) x p)