Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Data Mining Techniques Association Rule
Privacy-Preserving Databases and Data Mining Yücel SAYGIN
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Basic Data Mining Techniques Chapter Decision Trees.
Ensemble Learning: An Introduction
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules
Basic Data Mining Techniques
Secure Incremental Maintenance of Distributed Association Rules.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Tools for Privacy Preserving Distributed Data Mining
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Data Mining Find information from data data ? information.
Security in Outsourcing of Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Privacy Preserving Data Mining
Approximate Frequency Counts over Data Streams
Market Basket Analysis and Association Rules
Association Analysis: Basic Concepts
Presentation transcript:

Security in Outsourced Association Rule Mining

Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work

Introduction  Data mining in company know about the past activities of their customers make strategic decisions  Types of data mining Association rules mining Clustering Classification

Association rules  “X => Y” If a transaction contains itemset X, the transaction will probably contain itemset Y Support: number of supporting transactions Confidence: proportion of transactions containing X which also contains Y

Performing data mining  Build application Development cost? Time?  Buy software Fit requirements? Maintenance?  Outsource

Concerns in outsourcing  Output Execution Assurance Correctness  Security Privacy of records Information of the company Company DB Data Miner

Approximate randomized technique

Approximate solution  Privacy Preserving Mining of Association Rules SIGKDD 2002 Authors: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke

Problem formulation  Let the set of transactions be T = {t 1, t 2, … t N }  Transform T to T’ = {t’ 1, t’ 2, … t’ N }  Mine in T’  Privacy breaches Itemset A cause a privacy breach of level p if for some item a in A  P[a in t i |A in t’ i ] >= p

Select-a-size randomization  For each transaction t i in T m = length of t i Select (non-uniformly) randomly an integer j from [0, m] Copy uniformly at random j items in t i to t’ i Consider every item a not in t i, add a to t’ i with a given probability p m

Run on real data  Privacy breach of level <= 50% P[a in t i |A in t’ i ] <= 50%  Accuracy = # true positive / (# found itemsets)  Set 1 Itemset Size True Itemset True Positive False Drops False Positive Accuracy % % %

Accuracy  Set 2: Itemset Size True Itemset True Positive False Drops False Positive Accuracy % % %

Problems  Estimated counts of large itemsets varies Lower accuracy of association rules  "beer and diaper" story customers who buy diapers tend also to buy beer hard to believe some strange rules  Expensive to make wrong decision Supermarket: layout design Health center: identify new disease

Security concerns  Individual transaction is protected  Private association rules can be estimated by other parties Adversary actions may be based on found association rules

Encryption

Problem formulation  Let the set of transactions be T = {t 1, t 2, … t N }  I is the entire set of items All t i is a subset of I  Transform T to T’ = {t’ 1, t’ 2, … t’ N }  A third party mines in T’ and gets AR’  Transform AR’ to AR

Architecture DB Transformer Association Rules Association Rules Mappings

Encryption  To protect a message, simple encryption can be applied “GOOD DOG” can be encrypted as “PLLX XLP”  Association rule encryption 752 => 891? Milk => Bread  Transaction encryption ?

Simple scheme  Encryption  For every transaction t i For every item x in t i  Add f(x) to t’ i where f is a bi-jective function  Decryption For every association rule r i  For every item y in r Replace y by f -1 (y)

Problems with simple encryption  They are easy to crack “PLLX XLP”  26 P 3 combinations, with at least one vowel Association rules  # Bread > # Car  # association rules, # large itemsets are disclosed  Solution Use a more complex scheme

Fake items  Probability to make a correct guess of a single mapping = 1 / |I|  Randomly add some fake items to each transaction Decrease the above probability to 1 / (|I| + |F|)

One-to-n Mapping  Originally, we are “one-to-one” mapping One item  One item A  1 B  2 C  3  We form “one-to-n” mapping A  1, 4, 5 B  2 C  3, 5 Greatly increase the number of possible mapping of an item  |I|+|F| C 1 + |I|+|F| C 2 + … |I|+|F| C |F|

Example transformation  T = {A} {B} {C} {A, B} {A, C} {B, C} {A, B, C}  T’ = {1, 4, 5} {2} {3, 5} {1, 2, 4, 5} {1, 3, 5} {2, 3, 5} {1, 2, 3, 4, 5} A  1, 4, 5 B  2 C  3, 5

Limitation on the mapping f  For any item x, there does not exist items y 1, y 2, …, y k (x ≠ y 1 ≠ … ≠ y k ) Such that f(x) subset in f(y 1 ) U f(y 2 ) U…f(y k )  Consider an example A  1, 2 B  2, 3 C  3, 4 AC  1, 2, 3, 4 ABC  1, 2, 3, 4

Limitation on the mapping f  For any item x f(x) – U i != x, i in I f(i) != empty  Every item must map to something unique

Mapping generation – Item Extend  Initialize every item to map to something unique I’  For every item x in IE Randomly pick some mappings Extend each mapping by x

Example run  A  1  B  2  C  3  IE = {4, 5}

Considering item 4  A  1  B  2  C  3  A  1, 4  B  2  C  3 Pick A

Considering item 5  A  1  B  2  C  3  A  1, 4, 5  B  2  C  3, 5 Pick A, C

Item Extend  Every item must map to something unique Say 1 is unique to f(A)  supp T (A) = supp T’ (1)  For a transaction t without item A Add a subset of unique mapping set to t’ with some probabilitysome probability {1, 4} is unique mapping set in f(A)  {}, {1}, {4}, {1, 4} may be added A  1, 4, 5 B  2 C  3, 5

Fake items again  Now, every item in t’ i must be in some mappings  Randomly add some fake items in |F| to each transaction  Mapping f: I -> |I’| U |IE| U |F| |I’|: core “unique” items |IE|: expanding items |F|: fake items

Basic transformation framework  For each transaction t For each item x in t  Add f(x) to t’ For item i in I - t  Add randomly subset of unique mapping set of f(i) to t’ For item f in F  Toss a biased coin for each item, add f to t’ if head (probability should be difference)

Recovering association rules  Given an encrypted rule in AR’ r’: X => Y  If there exists i 1, i 2, …, i m in I U k=1 m f(i k ) = X  And there exists j 1, j 2, …, j n in I U k=1 n f(j k ) = XUY  r: {i 1, i 2, … i m } => {j 1, j 2, …, j n } – {i 1, i 2, … i m } is a rule in AR  Otherwise, the rule is not correct

Example  Given 1 => 4 (rejected) 2 => 1, 5 (rejected) 2 => 1, 3, 5 (rejected) 2 => 1, 3, 4, 5 (B => AC) 2, 3, 5 => 1, 4 (BC => A)  2, 3, 5 => BC  1, 2, 3, 4, 5 => ABC Mapping f A  1, 4, 5 B  2 C  3, 5

Correctness  Proposition For any item x, y, f is transformation mapping  supp T (x) = supp T’ (f(x))  supp T (xUy) = supp T’ (f(x) U f(y)) For any itemset X, Y, F is the transformation mapping  supp T (X) = supp T’ (F(X))  supp T (XUY) = supp T’ (F(X) U F(Y))  No false drops and false positives

Summary  Generation of mappings One-to-n mappings Item Extend  Transformation of transactions Mapping f(x) Subsets of unique mapping set Fake items  Recovering association rules Reverse mappings and filtering

Test run  # Items = 1k, |T| = 1k  Without transformation One rule Time: 8s  Item Extend 147 rules Total times: 26s Mappings generation and transformation: 219ms

Future Work  Define parameters to the problem Size of |IE| Size of |F|  Give a clear measure of security  Give a clear measure of overhead  Correctness of association rules Query execution proof Result verification

The End

Choosing probability  Uniform distribution or any fixed distribution give patterns which may be easily identified  Random probability distribution {}: 70%, {1}: 5%, {4}: 15%, {1, 4}: 20% Storage: need additional storage Back

Algorithm for transformation  Transformation is the most costly process  Execution time linear to database size |T|  Should be as fast as possible

Optimization  Mapping Retrieval For an item x, use a hash table to retrieve the mapping, h(x)  Adding fake items First randomly (according to the probability of adding items) determine the number of items to add Randomly pick in the set (non-uniform distribution) Gives a much shorter runtime in average

Choice of mapped items 12…|I|+|IE|+|F| * (1+ δ)  Acceptable as long as it is not easy to identify I’, IE, F  One way is to use random permutation of first |I| + |IE| + |F| natural numbers  First |I| numbers are mapped to |I’|  Next |IE| numbers are IE

Cut and paste randomization  One case of select-a-size randomization  The way to perform selection of j Given an integer K m > 0 Randomly choose j in [0, K m ] If (j > m)  Set j = m  Overall input parameters K m p m

Effects on support  Support of A in T’ A in t, without replaced A’ in t, randomly add A  Support of AB in T’ AB in t, without replaced A and B AB’ in t, randomly add B A’B in t, randomly add A A’B’ in t, randomly add A and B

Estimating original support  Support of A in T, x Support of A in T’, y x * P(A remains in original transaction) + (|DB| - x) * p m = y  Support of AB in T Support of AB in T’ Support of AB’, A’B in T’ Support of A’B’ in T’

Apriori property  Suppose m = 2 for all t in T  |T| = 10, |I| = {A, B}  p m = 0, j = 1,  Support of B in T’ supp T’ (B)= 0 E(supp T (B)) = 0  supp T’ (A)= 10  supp T’ (AB)= 0  E(supp T (AB)) = supp T’ (A) * 1 = 10

Apriori property  An expected large itemset may have an expected small sub-set  But generally the support of subsets are not too small  Instead of using the support threshold to filter all small candidates, use a smaller value

Apriori algorithm  Generate candidate sets  Scan database for counts  Recover the predicted support  Discard candidates with support smaller than <= candidate limit  Save for output candidates with support >= support threshold  Apriori_gen(remaining candidate)

Candidate limit  A high value Increase numbers of false drops Poor correctness  A small value Increase number of candidate sets High running time  Experiment Support threshold: s min estimated s.d.: δ s min – δ is found to be a good value

Other applications  Outsourced transaction database (secure) storage  Outsourced association rule mining using data stream  Secure distributed association rule mining with third party miner

Outsourced database with association rule mining service DB Transformer Association Rules Association Rules Mappings Transactions Query