Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Similar presentations


Presentation on theme: "Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation."— Presentation transcript:

1 Privacy-preserving rule mining

2 Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation - publishing  Encryption - outsourcing Distributed multiparty  Cryptographic protocols – collaborative mining  Hiding sensitive rules

3 Association Rule Mining  Transactional datasets A transaction t = {a,b,c,…} a,b,c,… are called items The length of transaction = # of items Transaction length can vary  Equivalent representation The set of all items : I A transaction t can be transformed to a boolean vector in length of |I|.

4 Association Rule Mining  Rule mining Goal: find the frequent itemset  Some itemset, e.g.{a,b,c}, appears frequently, higher than certain support.  Rules can be derived from the itemset: A  B {a,b,c} is frequent, then ab  c, a  bc, … Metrics  Support = # of occurrences of itemset/ total # of transactions i.e., prob (A,B)  Confidence = # of occurrences of itemset/# of occurrences of left (of the rule) I.e. the conditional prob: Pr (B|A)

5  Example E.g. a,b appears 100 times together, while abc appears 50 times together in total 5000 transactions  Support of abc = 50/5000 = 0.01  Confidence of ab  c = 50/100 = 0.5

6 Algorithms  Apriori observation: if itemset A is a part of B, then support(A) >= support (B) Steps in finding frequent itemsets:  Starting from single-item set, pruning the itemsets that have support < threshold  When we have a set of k-itemsets, we expand it to k+1-itemsets, and check their supports.  Using data structures like hash tree to speed up the counting process

7  Algorithm Generating rules with confidence threshold  Confidence (A  B) = P(B|A) = support(AB)/support(A)

8 Centralized approaches  Two types of methods (Categorical data) Perturbation Encryption

9 Perturbation  Papers 1."Privacy Preserving Mining of Association Rules," SIGKDD 2002. 2."Maintaining data privacy in association rule mining”, VLDB 02 3.A framework for high-accuracy privacy- preserving mining, ICDE05

10  Basic ideas Consider a transaction as a boolean bit vector Perturb each bit with certain method  Paper 1: randomly select j items from t, then for rest of all items, with prob p to be selected  Paper 2: each bit has the prob p to be original, 1-p to be flipped  Paper 3: unify the methods with perturbation matrix

11 The key is…  After you perturb the data, you should still be able to find the supported rules correctly.  The accuracy is traded off by the intensity of perturbation (p)

12 Methods discovering the original support  Paper 1 using the correlation between “partial support” to find the original support  Concept of partial support  Prob of the length change of matched parts The size of t : m, the size of itemset A: k

13 Some results  Let s i be sup i (A) and s i ’ be sup i’ (A) The matrix P and D are defined with only p[l  l’] 1. 2. From 1, we can estimate the original support From 2, we can estimate the reliability (variance) of the support Estimation (which is related to perturbation rate p)

14 Privacy  Given an itemset A in perturbed transaction t’ What is the probability of an item a, really in the itemset A, i.e.,

15 Tradeoff between utility and privacy Lowest discoverable support: distinguishable from zero (consider the variance of support estimation)

16 Encryption method  Paper: Security in Outsourcing of Association Rule Mining, VLDB2007  Substitution encryption 1-1 substitution a  1, b  2, … 1-n substitution a  {1,10}, b  {2,11,12},…  Problem: 1-1 substitution is weak Arbitrary 1-n substitution does not work  Cannot recover original rules from the rules from the substituted items.

17 The basic idea  Fake items Original n items, additional m fake items  Define “admissible 1-n mapping” Arbitrary 1-n mapping may result in irreversible results  E.g., a  {1,2}, b  {2}, c  {3}  If we find frequent itemset {1,2,3} in the substituted set, {ac} or {abc}, which one is the right original itemset? Admissible 1-n mapping  For each mapping, there should be at least one unique substitute item in the mapped result, which does not appear in other mapping  E.g., a  {1,2}, b  {2}, c  {3} breaks the definition while a  {1,2}, b  {2,4}, c  {3} is admissible

18 Recovering rules  When we use admissible mappings We are able to reverse the discovered rules on substituted set. E.g., if we find {1,2,4} is a frequent set check all mappings: {1,2}  a, {2,4}  b  {1,2,4}  {ab}

19 cost  Additional cost Generating item mapping Generating transaction transformation  significant  Cost of rule mining Both the # of items and the average length of transaction is increased, thus the total cost will increase

20 Features of encryption method  Rules can be accurately recovered  A tradeoff between cost and privacy Privacy is better preserved with more fake items More fake items will result in higher additional cost.  The VLDB07 paper weaknesses No strong proof on security does not mention any attacks – there are possibly attacks on this scheme

21 Distributed datasets  Perturbation works, but existing encryption protocols do not work might need some protocols to make the encryption method applicable  Horizontally or vertically partitioned Paper1: "Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data," Paper2: "Privacy preserving association rule mining in vertically partitioned data," Shared features: using the cryptographic methods to construct protocols

22 Hiding sensitive rules  When publishing data for rule mining, the rule itself can be sensitive too.  Basic methods Decrease the support  Support = # of occurrences of itemset/ total # of transactions Decrease the confident  Confidence = # of occurrences of itemset/# of occurrences of left(rule)

23 discussion on rule hiding  Need sufficient amount of computational cost at the data owner side  You should know what rules are sensitive in advance! So only necessary for the case you have to share the data  When hiding sensitive rules, other rules might be damaged

24 Summary  Two methods for single party rule mining Perturbation and encryption  Distributed multiparty can use protocols and perturbation method  Hiding sensitive rules is also important in some cases


Download ppt "Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation."

Similar presentations


Ads by Google