Presentation is loading. Please wait.

Presentation is loading. Please wait.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Similar presentations


Presentation on theme: "Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work."— Presentation transcript:

1 Security in Outsourced Association Rule Mining

2 Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work

3 Introduction  Data mining in company know about the past activities of their customers make strategic decisions  Types of data mining Association rules mining Clustering Classification

4 Association rules  “X => Y” If a transaction contains itemset X, the transaction will probably contain itemset Y Support: number of supporting transactions Confidence: proportion of transactions containing X which also contains Y

5 Performing data mining  Build application Development cost? Time?  Buy software Fit requirements? Maintenance?  Outsource

6 Concerns in outsourcing  Output Execution Assurance Correctness  Security Privacy of records Information of the company Company DB Data Miner

7 Approximate randomized technique

8 Approximate solution  Privacy Preserving Mining of Association Rules SIGKDD 2002 Authors: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke

9 Problem formulation  Let the set of transactions be T = {t 1, t 2, … t N }  Transform T to T’ = {t’ 1, t’ 2, … t’ N }  Mine in T’  Privacy breaches Itemset A cause a privacy breach of level p if for some item a in A  P[a in t i |A in t’ i ] >= p

10 Select-a-size randomization  For each transaction t i in T m = length of t i Select (non-uniformly) randomly an integer j from [0, m] Copy uniformly at random j items in t i to t’ i Consider every item a not in t i, add a to t’ i with a given probability p m

11 Run on real data  Privacy breach of level <= 50% P[a in t i |A in t’ i ] <= 50%  Accuracy = # true positive / (# found itemsets)  Set 1 Itemset Size True Itemset True Positive False Drops False Positive Accuracy 165 00100% 2228212162888% 322184578%

12 Accuracy  Set 2: Itemset Size True Itemset True Positive False Drops False Positive Accuracy 1266254123189% 2217195224581% 3484352662%

13 Problems  Estimated counts of large itemsets varies Lower accuracy of association rules  "beer and diaper" story customers who buy diapers tend also to buy beer hard to believe some strange rules  Expensive to make wrong decision Supermarket: layout design Health center: identify new disease

14 Security concerns  Individual transaction is protected  Private association rules can be estimated by other parties Adversary actions may be based on found association rules

15 Encryption

16 Problem formulation  Let the set of transactions be T = {t 1, t 2, … t N }  I is the entire set of items All t i is a subset of I  Transform T to T’ = {t’ 1, t’ 2, … t’ N }  A third party mines in T’ and gets AR’  Transform AR’ to AR

17 Architecture DB Transformer Association Rules Association Rules Mappings

18 Encryption  To protect a message, simple encryption can be applied “GOOD DOG” can be encrypted as “PLLX XLP”  Association rule encryption 752 => 891? Milk => Bread  Transaction encryption ?

19 Simple scheme  Encryption  For every transaction t i For every item x in t i  Add f(x) to t’ i where f is a bi-jective function  Decryption For every association rule r i  For every item y in r Replace y by f -1 (y)

20 Problems with simple encryption  They are easy to crack “PLLX XLP”  26 P 3 combinations, with at least one vowel Association rules  # Bread > # Car  # association rules, # large itemsets are disclosed  Solution Use a more complex scheme

21 Fake items  Probability to make a correct guess of a single mapping = 1 / |I|  Randomly add some fake items to each transaction Decrease the above probability to 1 / (|I| + |F|)

22 One-to-n Mapping  Originally, we are “one-to-one” mapping One item  One item A  1 B  2 C  3  We form “one-to-n” mapping A  1, 4, 5 B  2 C  3, 5 Greatly increase the number of possible mapping of an item  |I|+|F| C 1 + |I|+|F| C 2 + … |I|+|F| C |F|

23 Example transformation  T = {A} {B} {C} {A, B} {A, C} {B, C} {A, B, C}  T’ = {1, 4, 5} {2} {3, 5} {1, 2, 4, 5} {1, 3, 5} {2, 3, 5} {1, 2, 3, 4, 5} A  1, 4, 5 B  2 C  3, 5

24 Limitation on the mapping f  For any item x, there does not exist items y 1, y 2, …, y k (x ≠ y 1 ≠ … ≠ y k ) Such that f(x) subset in f(y 1 ) U f(y 2 ) U…f(y k )  Consider an example A  1, 2 B  2, 3 C  3, 4 AC  1, 2, 3, 4 ABC  1, 2, 3, 4

25 Limitation on the mapping f  For any item x f(x) – U i != x, i in I f(i) != empty  Every item must map to something unique

26 Mapping generation – Item Extend  Initialize every item to map to something unique I’  For every item x in IE Randomly pick some mappings Extend each mapping by x

27 Example run  A  1  B  2  C  3  IE = {4, 5}

28 Considering item 4  A  1  B  2  C  3  A  1, 4  B  2  C  3 Pick A

29 Considering item 5  A  1  B  2  C  3  A  1, 4, 5  B  2  C  3, 5 Pick A, C

30 Item Extend  Every item must map to something unique Say 1 is unique to f(A)  supp T (A) = supp T’ (1)  For a transaction t without item A Add a subset of unique mapping set to t’ with some probabilitysome probability {1, 4} is unique mapping set in f(A)  {}, {1}, {4}, {1, 4} may be added A  1, 4, 5 B  2 C  3, 5

31 Fake items again  Now, every item in t’ i must be in some mappings  Randomly add some fake items in |F| to each transaction  Mapping f: I -> |I’| U |IE| U |F| |I’|: core “unique” items |IE|: expanding items |F|: fake items

32 Basic transformation framework  For each transaction t For each item x in t  Add f(x) to t’ For item i in I - t  Add randomly subset of unique mapping set of f(i) to t’ For item f in F  Toss a biased coin for each item, add f to t’ if head (probability should be difference)

33 Recovering association rules  Given an encrypted rule in AR’ r’: X => Y  If there exists i 1, i 2, …, i m in I U k=1 m f(i k ) = X  And there exists j 1, j 2, …, j n in I U k=1 n f(j k ) = XUY  r: {i 1, i 2, … i m } => {j 1, j 2, …, j n } – {i 1, i 2, … i m } is a rule in AR  Otherwise, the rule is not correct

34 Example  Given 1 => 4 (rejected) 2 => 1, 5 (rejected) 2 => 1, 3, 5 (rejected) 2 => 1, 3, 4, 5 (B => AC) 2, 3, 5 => 1, 4 (BC => A)  2, 3, 5 => BC  1, 2, 3, 4, 5 => ABC Mapping f A  1, 4, 5 B  2 C  3, 5

35 Correctness  Proposition For any item x, y, f is transformation mapping  supp T (x) = supp T’ (f(x))  supp T (xUy) = supp T’ (f(x) U f(y)) For any itemset X, Y, F is the transformation mapping  supp T (X) = supp T’ (F(X))  supp T (XUY) = supp T’ (F(X) U F(Y))  No false drops and false positives

36 Summary  Generation of mappings One-to-n mappings Item Extend  Transformation of transactions Mapping f(x) Subsets of unique mapping set Fake items  Recovering association rules Reverse mappings and filtering

37 Test run  # Items = 1k, |T| = 1k  Without transformation One rule Time: 8s  Item Extend 147 rules Total times: 26s Mappings generation and transformation: 219ms

38 Future Work  Define parameters to the problem Size of |IE| Size of |F|  Give a clear measure of security  Give a clear measure of overhead  Correctness of association rules Query execution proof Result verification

39 The End

40 Choosing probability  Uniform distribution or any fixed distribution give patterns which may be easily identified  Random probability distribution {}: 70%, {1}: 5%, {4}: 15%, {1, 4}: 20% Storage: need additional storage Back

41 Algorithm for transformation  Transformation is the most costly process  Execution time linear to database size |T|  Should be as fast as possible

42 Optimization  Mapping Retrieval For an item x, use a hash table to retrieve the mapping, h(x)  Adding fake items First randomly (according to the probability of adding items) determine the number of items to add Randomly pick in the set (non-uniform distribution) Gives a much shorter runtime in average

43 Choice of mapped items 12…|I|+|IE|+|F| * (1+ δ)  Acceptable as long as it is not easy to identify I’, IE, F  One way is to use random permutation of first |I| + |IE| + |F| natural numbers  First |I| numbers are mapped to |I’|  Next |IE| numbers are IE

44 Cut and paste randomization  One case of select-a-size randomization  The way to perform selection of j Given an integer K m > 0 Randomly choose j in [0, K m ] If (j > m)  Set j = m  Overall input parameters K m p m

45 Effects on support  Support of A in T’ A in t, without replaced A’ in t, randomly add A  Support of AB in T’ AB in t, without replaced A and B AB’ in t, randomly add B A’B in t, randomly add A A’B’ in t, randomly add A and B

46 Estimating original support  Support of A in T, x Support of A in T’, y x * P(A remains in original transaction) + (|DB| - x) * p m = y  Support of AB in T Support of AB in T’ Support of AB’, A’B in T’ Support of A’B’ in T’

47 Apriori property  Suppose m = 2 for all t in T  |T| = 10, |I| = {A, B}  p m = 0, j = 1,  Support of B in T’ supp T’ (B)= 0 E(supp T (B)) = 0  supp T’ (A)= 10  supp T’ (AB)= 0  E(supp T (AB)) = supp T’ (A) * 1 = 10

48 Apriori property  An expected large itemset may have an expected small sub-set  But generally the support of subsets are not too small  Instead of using the support threshold to filter all small candidates, use a smaller value

49 Apriori algorithm  Generate candidate sets  Scan database for counts  Recover the predicted support  Discard candidates with support smaller than <= candidate limit  Save for output candidates with support >= support threshold  Apriori_gen(remaining candidate)

50 Candidate limit  A high value Increase numbers of false drops Poor correctness  A small value Increase number of candidate sets High running time  Experiment Support threshold: s min estimated s.d.: δ s min – δ is found to be a good value

51 Other applications  Outsourced transaction database (secure) storage  Outsourced association rule mining using data stream  Secure distributed association rule mining with third party miner

52 Outsourced database with association rule mining service DB Transformer Association Rules Association Rules Mappings Transactions Query


Download ppt "Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work."

Similar presentations


Ads by Google