Presentation is loading. Please wait.

Presentation is loading. Please wait.

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.

Similar presentations


Presentation on theme: "SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University."— Presentation transcript:

1 SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University

2 OUTLINE Preliminary – Frequent ItemSet Mining Motivation Privacy Model – K-Support Anonymity Algorithm Performance Studies Conclusion 2

3 OUTLINE Preliminary – Frequent ItemSet Mining Motivation 3

4 FREQUENT ITEMSET MINING (FIM) Discover what happened frequently 4 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

5 FREQUENT ITEMSET MINING (FIM) Discover what happened frequently Frequent itemset mining (FIM) 5 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

6 THE NEEDS OF OUTSOURCING FIM For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. 6 Data Owner Mining Services Provider (Cloud Computing)

7 THE NEEDS OF OUTSOURCING FIM For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. 7 Data Owner Mining Services Provider (Cloud Computing) Privacy?!

8 THE RISKS OF OUTSOURCING FIM Encryption/decryption method is believed as the possible solution. 8 Mining Services Provider (Cloud Computing) Data Owner

9 THE RISKS OF OUTSOURCING FIM Encryption/decryption method is believed as the possible solution. 9 Mining Services Provider (Cloud Computing) Data Owner How to achieve the encryption and decryption? Privacy protected Correct mining results Reasonable overhead

10 THE RISKS OF OUTSOURCING FIM 10 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Trans. IDItems 1a 2a, c 3c, d 4a, b, c 5a, b, d Encrypt

11 THE RISKS OF OUTSOURCING FIM Top frequency attack Wine is the most frequent item  ‘a’ is ‘wine’ Approximate support attack The support of cigar is about 55%~60%  ‘c’ is ‘cigar’ 11 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Trans. IDItems 1a 2a, c 3c, d 4a, b, c 5a, b, d Encrypt

12 THE RISKS OF OUTSOURCING FIM Top frequency attack Wine is the most frequent item  ‘a’ is ‘wine’ Approximate support attack The support of cigar is about 55%~60%  ‘c’ is ‘cigar’ 12 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Trans. IDItems 1a 2a, c 3c, d 4a, b, c 5a, b, d Encrypt The Risks of Outsourcing FIM The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science, 2004. The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science, 2004.

13 RELATED WORKS Encrypt each real items by a one-many mapping function. Wong, W. K., Cheung, D. W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of Association Rule Mining. In: Proc. of VLDB, 2007. However, it does not try to anonymize the support information. Recently it is cracked. Molloy, I., Li, N., Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. In: Proc. of ICDM, 2009. 13

14 OUTLINE Preliminary – Frequent ItemSet Mining Motivation Privacy Model – K-Support Anonymity 14

15 K-SUPPORT ANONYMITY & ANONYMIZATION For every sensitive item, there are at least k-1 other items of the same support. The probability of an item being correctly re-identified is limited to 1/k, even when the precise support information is known. Given a transactional database T, encrypt T into E(T) such that There exist a decryption function D such that MiningResult(T, Δ)= D (MiningResult(E(T), Δ)), for any minimal support Δ. E(T) is k-support anonymous. 15

16 SOLUTION 1: A NAÏVE APPROACH For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. 16 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine For k = 3, 16 additional items are required. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea Items a, e, g, h, i a, c, e, f, h, i c, d, e, f, g a, b, c, f, h a, b, d, e, f, g

17 A NAÏVE SOLUTION For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. 17 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine For k = 3, 16 additional items are required. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea Items a, e, g, h, i a, c, e, f, h, i c, d, e, f, g a, b, c, f, h a, b, d, e, f, g There could be too large storage overhead when k is large.

18 GENERALIZED FIM Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items 18 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine beer tea alcoholic wine beverage all prod. cigar When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent.

19 GENERALIZED FIM Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items 19 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine beer tea alcoholic wine beverage all prod. cigar When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent. 1. The support of a parent node comes from the supports of it child nodes. 2. Only lead nodes need to appear in the transactions. 1. The support of a parent node comes from the supports of it child nodes. 2. Only lead nodes need to appear in the transactions.

20 OUTLINE Preliminary – Frequent ItemSet Mining Motivation Privacy Model – K-Support Anonymity Algorithm 20

21 ANONYMIZATION: OVERVIEW For storage efficiency, we suggest to convert FIM to GFIM. 21 Pseudo Taxonomy Generation in the Encryption Encrypt Transaction Data Transaction Data Frequent Itemsets Pseudo Taxonomy Transaction Data Encrypted Decrypt Frequent Itemsets Data OwnerThird Party Generalized Frequent Itemset Mining

22 wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} ANONYMIZATION: STORAGE EFFICIENCY In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. 22 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Encrypt with k=3 4 additional items required a f f beer wine j j cigar e e b i k cd gh tea Trans. ID Items 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h 2 2 1 1 1 1

23 In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} ANONYMIZATION: STORAGE EFFICIENCY 23 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Encrypt with k=3 4 additional items required a f f beer wine j j cigar e e b i k cd gh tea Trans. ID Items 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h 2 2 1 1 1 1 Small storage overhead compared to the naïve method.

24 ANONYMIZATION: EASY DECRYPTION The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. 24 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine min_sup = 2 a f beer wine j cigar e b i k cd gh tea Trans. IDItems 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …}

25 a f beer wine j cigar e b i k cd gh tea ANONYMIZATION: EASY DECRYPTION The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. 25 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine min_sup = 2 Trans. IDItems 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …} The data owner can obtain the real results in 1 scan of the returned itemsets.

26 ANONYMIZATION: ENCRYPTION 26 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine Encrypt with k=3 a f beer wine j cigar e b i k cd gh tea Trans. ID Items 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h The problem is how to build the taxonomy and encrypt T for k-support anonymity.

27 ANONYMIZATION: ENCRYPTION 1: Generalization of the Mining Task To generate a pseudo taxonomy that can (a) conserve the correct and complete mining results, (b) facilitate k-support anonymization. 2: Anonymization with Taxonomy Tree To encrypt T for k-support anonymity with the help of the constructed taxonomy tree. 27

28 1: GENERALIZATION OF THE MINING TASK Build a k-bud tree of T All real items at the leaf level The number of nodes in three categories is equal to or greater than k Let x M denote the most frequent real item in T A > = { v | sup(v) > sup(x M ) and v is leaf}, A = = { v | sup(v) = sup(x M )}, and A < = { v | sup(v) < sup(x M ) < sup(u), where u is the parent node of v }. 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3-bud tree Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 28

29 1: GENERALIZATION OF THE MINING TASK 29 beer cigar beer cigar wine tea 3 groups Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine

30 1: GENERALIZATION OF THE MINING TASK 30 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer)(wine) 2 (cigar) 4 3 3 subtrees (tea)

31 1: GENERALIZATION OF THE MINING TASK 31 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 (tea)(beer) 2 4 (wine) (cigar) 4 3 5 Iteratively connect a subtree which sup(root) ≧ sup(wine) with the other subtree

32 1: GENERALIZATION OF THE MINING TASK 32 Trans. IDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3 bud-tree

33 2: ANONYMIZATION WITH TAXONOMY TREE Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity Insertion Split Increase 33

34 2: ANONYMIZATION WITH TAXONOMY TREE Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity Insertion (Ex.) Split Increase 34 u p p v q q u v v p: the node with target support q: randomly select sup(p) – sup(v) transactions from T(u) – T(v) T(x) is the set of transactions containing the item x. sup(v) < target-sup < sup(u) sup(u) and sup(v) should not be changed.

35 TIDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3-bud tree 4 4 5 22 (tea) insertion Items wine, p1 cigar, wine, p1 cigar, tea beer, cigar, wine beer, tea, wine p1 x y 35 2: ANONYMIZATION WITH TAXONOMY TREE For wine

36 2: ANONYMIZATION WITH TAXONOMY TREE Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity Insertion Split (Ex.) Increase 36 v q q p p v p: randomly select target-sup transactions from T(v) q: T(p) = T(v) – T(q) T(x) is the set of transactions containing the item x. target-sup < sup(v) sup(v) should not be change. Split operation can raise up leaf nodes to internal nodes!

37 TIDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3-bud tree 4 4 5 22 (tea) 4 (wine) 5 5 3 3 1 insertion split Items wine, p1 cigar, wine, p1 cigar, tea beer, cigar, wine beer, tea, wine Items p1, p2 cigar, p1, p3 cigar, tea beer, cigar, p2 beer, tea, p2 p1 p2p3 x y 37 2: ANONYMIZATION WITH TAXONOMY TREE For wineFor cigar

38 2: ANONYMIZATION WITH TAXONOMY TREE Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity Insertion Split Increase (Ex.) 38 u v u v v sup(v) < target-sup randomly select target-sup – sup(v) transactions from T(u) – T(v) sup(v) should not be changed. So, Increase operation is applicable only on node that does not belong to any anonymous group!

39 TIDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3-bud tree 4 4 5 22 (tea) 4 (wine) 5 5 3 3 1 insertion 4 (wine) 5 5 3 3 split increase Items wine, p1 cigar, wine, p1 cigar, tea beer, cigar, wine beer, tea, wine Items p1, p2 cigar, p1, p3 cigar, tea beer, cigar, p2 beer, tea, p2 Items p1, p2, p3 cigar, p1, p3 cigar, tea beer, cigar, p2 beer, tea, p2, p3 p1 p2p3 x y 39 2: ANONYMIZATION WITH TAXONOMY TREE For wineFor cigar

40 TIDItems 1wine 2cigar, wine 3cigar, tea 4beer, cigar, wine 5beer, tea, wine 2 4 (beer) (wine) 2 (cigar) 4 3 5 5 (tea) 3-bud tree TIDItems 1c, d, g 2b, d, g 3b, h 4a, b, c 5a, c, d, h 3-support anonymity 4 4 5 22 (tea) 4 (wine) 5 5 3 3 1 insertion 2 4 a (beer) f (wine) 4 b (cigar) 4 3 5 5 33 22 h (tea) cd g e ij k 4 (wine) 5 5 3 3 split increase Items wine, p1 cigar, wine, p1 cigar, tea beer, cigar, wine beer, tea, wine Items p1, p2 cigar, p1, p3 cigar, tea beer, cigar, p2 beer, tea, p2 Items p1, p2, p3 cigar, p1, p3 cigar, tea beer, cigar, p2 beer, tea, p2, p3 p1 p2p3 x y 40 2: ANONYMIZATION WITH TAXONOMY TREE For wineFor cigar

41 OUTLINE Preliminary – Frequent ItemSet Mining Motivation Privacy Model – K-Support Anonymity Algorithm Performance Studies Conclusion 41

42 PERFORMANCE STUDIES Data sets Retail dataset 88162 transactions with 2117 different items T10I1kD100k dataset 100k transactions with 1000 different items Security Against precise item support attacks Against precise itemset support attacks Storage overhead Execution efficiency 42

43 SECURITY Against precise item support attacks Item accuracy: The ratio of items being re-identified DB accuracy: The avg. ratio of items in a transaction being re-identified 43 (a) Retail dataset(b) T10I1kD100k dataset

44 SECURITY Against precise itemset support attacks Item accuracy: The ratio of items being re-identified DB accuracy: The avg. ratio of items in a transaction being re-identified 44 (a) Retail dataset(b) T10I1kD100k dataset

45 STORAGE OVERHEAD & EXECUTION EFFICIENCY 45 (a) Retail dataset(b) T10I1kD100k dataset(a) Retail dataset(b) T10I1kD100k dataset

46 SUMMARY We proposed k-support anonymity to enhance the privacy protection in outsourcing of frequent itemset mining (FIM). For storage efficiency, we transformed FIM to GFIM, and proposed a taxonomy-based anonymization algorithm. Our method allows the data owner to obtain the real frequent itemsets in 1 scan of the returned results. Experimental results on both real and synthetic data sets showed that our method can achieve very good privacy protection with moderate storage overhead. 46

47 Q & A


Download ppt "SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University."

Similar presentations


Ads by Google