Secure Incremental Maintenance of Distributed Association Rules.

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.

Rumor Routing in Sensor Networks David Braginsky and Deborah Estrin Presented By Tu Tran 1.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.

1 Introduction to Database Management Systems Lila Rao Graham.

Distributed DBMSPage 5. 1 © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture  Distributed Database.

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Public Key Encryption that Allows PIR Queries Dan Boneh 、 Eyal Kushilevitz 、 Rafail Ostrovsky and William E. Skeith Crypto 2007.

Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.

Team Dosen UMN Physical DB Design Connolly Book Chapter 18.

Guomin Yang et al. IEEE Transactions on Wireless Communication Vol. 6 No. 9 September

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Lecture 9 Methodology – Physical Database Design for Relational Databases.

Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:

Aggregation in Sensor Networks

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Chapter 16 Methodology – Physical Database Design for Relational Databases.

Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Tools for Privacy Preserving Distributed Data Mining

Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

Mining High Utility Itemset in Big Data

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Background on security

10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.

Salah A. Aly,Moustafa Youssef, Hager S. Darwish,Mahmoud Zidan Distributed Flooding-based Storage Algorithms for Large-Scale Wireless Sensor Networks Communications,

Generating RCPSP instances with Known Optimal Solutions José Coelho Generator and generated instances in:

Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.

Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.

Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

Methodology – Physical Database Design for Relational Databases.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.

1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

Security in Outsourcing of Association Rule Mining

A paper on Join Synopses for Approximate Query Answering

Methodology – Physical Database Design for Relational Databases

Byung Joon Park, Sung Hee Kim

Algorithm Analysis CSE 2011 Winter September 2018.

Association Rule Mining

Privacy Preserving Data Mining

Farzaneh Mirzazadeh Fall 2007

Mining Sequential Patterns

Outline Introduction Background Distributed DBMS Architecture

Presentation transcript:

Secure Incremental Maintenance of Distributed Association Rules

Agenda Introduction Secure Technologies Problem Definition Our algorithm Experiments Conclusions

Introduction Association Rules – A means to identify patterns and trends Secure Distributed Association Rules – Privacy is concerned – Restricted usage of some information Maintenance of environment – Association rules with more sites – Use past results to reduce workload

Secure Data Mining Approach 1: Data Obfuscation – Association rules from modified data – Simple algorithms but may get false rules Approach 2: Secure Protocols – Complex communication – Difficult and costly algorithms but get accurate rules – Balance between cost and privacy

Secure Technologies Secure Sum – There are n sites – Each site holds a private number – Compute the sum of a group of sites Secure Union – There are n sites – Each site holds a private set of items – Compute the union of sets

Secure Sum Example Site 1 Site 2Site – 28 mod 40 = Upper Bound: 40 R = mod 40 = 9

Secure Technologies Secure Comparison – Two sites – A site holds a number a, another holds a number b – Check if a >= b without letting anyone knows the value of a and b

Problem Definition There are n old sites – Knows the association rules in these sites There are r new sites – Requires update of association rules in new environment Maintain the privacy as well

Privacy? What to protect? Different requirements in different situation Basic requirements – Protect individual transaction – Protect individual site information Local large itemsets, counts for itemsets Secure Multi-party computation – The process does not reveal any other useful information except the information that can be derived from own input and the final result

Algorithms Secure Incremental Maintenance of Distributed Association Rules (SIMDAR) – Mining association rules with basic privacy level More Secure Incremental Maintenance of Distributed Association Rules (MSIMDAR) – Mining association rules under the definition of Secure Multiparty computation

SIMDAR: What we know? (Assumption) Original Large Itemset L k is available Total count for each old large Itemset is known All sites follow a semi-honest model – They follow the rules, but may try to guess other ’ s information based on the received data (intermediate messages) No collusion among any sites – Sites do not exchange intermediate information

Algorithm - SIMDAR To find the large itemsets – Generate the candidate sets – Count on the candidates – Summing counts – Check for large itemset Check if an association rule holds – Easy with counts available

Generate the candidates C1 = I For Ck, – Each new site generates its own candidate set with own (k-1)th locally large and globally large itemsets Secure Union to find the candidate sets from the new sites Union with L k

Summing on candidates Partition into 2 groups – P k : in L k – Q k : not in L k For P k, we got the original count, just add up the count in new sites using secure sum (no scan on old sites)

Summing Count for Q k First summed up in new sites, we get a count If the itemset is large in new sites, send to old sites for scan Otherwise, prune away

Information Protected by SIMDAR Individual transaction – We never access to individual transaction of others Large Itemset of specific site – They are input to Secure Union Count of each Itemset on each site – They are input to Secure Sum

MSIMDAR: for Higher privacy level Final result: global association rules Input: Site database Other information should be protected Cannot reveal large itemsets? – Costly checking – We treat the large itemsets as part of the result

MSIMDAR Target: Global large itemsets and association rules Useful information revealed by SIMDAR – Total Counts of itemsets – Original results of large itemset to new sites – New Candidates at new sites to old sites Add fake itemset to hide the actual supported itemsets

MSIMDAR Hiding the total count of an itemset – Do we really need to find out the total count? Protect the large itemsets of the original results – Use a more complex protocol

MSIMDAR – Adding Total excess count: – X.excess = X.count – s% |DB| Instead of summing X.count i, we sum the excess count X.excess i – Even revealed, we cannot know the count and database size Checking for large itemsets after Secure Sum – Sa (the first site) holds random key Rx – Sb (the last site) holds (X.count – s% |DB| + Rx) – Secure Comparison between Sa and Sb

Storage We can reuse it in future and we need it in the future – Checking for association rules requires counting information – Prepare for next update

Storage Commonly used method – Each site holds their own information Count for each itemset Database size – need to calculate the total count each time

Storage We first sum the total database size |DB| using Secure Sum – S u (first site) holds the key of secure sum R t – S v (last site) get the sum |DB| + R t For each itemset X, we store also – The protecting key R x – The protected excess count X.excess + R x

Reusing the count Checking association rules – A.count – c% B.count > 0 Can be derived by six stored numbers – N1 + (-1)N2 + (-c%)N3 + (c%)N4 + (c%-1)s%N5 + (1-c%)N6 N1 = A.excess + R a N2 = R a N3 = B.excess + R b N4 = R b N5 = |DB| + R t N6 = R t Secure sum and secure comparison

Avoiding new sites knowing past results Generating the candidates is similar except an old site will join to the Secure Union process For counting, two old sites will join Define: – P k = L k intersect C k – Q k = C k – P k – Note that the new sites should not be able to distinguish P k and Q k

Adding counts in new site

Adding for P k Old sites A B New Sites Random Key Sum Protected excess Secure Compare A

Adding for Q k Old sites A B New Sites 0 Sum 0 Secure Compare A

New site pruning New sites sends the count to an old site to continue We got final excess count for P k – Comparison means if the itemset is large in all sites We got excess count in new sites for Q k – Comparison means if the itemset is large in new sites

Experiments 3 programs – With privacy but no maintenance (SEC) – No Privacy but maintenance (MAN) – With privacy and maintenance (MSIDMAR) Environment – P4 1.7GHz under Linux – Each site is simulated by an individual computer Measure – CPU time

DB size

Support

Ratio Total 12:39:66:93:12

Ratio

Analysis Process time at new sites takes much longer – About 3 time to 5 times of that of old sites Cost overhead due to secure algorithm – At old sites, average 10% of total cost – At new sites, average 6% of total cost – Both decrease in proportion with increase in db size

Conclusion We have proposed algorithms to solve the maintenance problem at different privacy level – All can give a more efficient solution than simply ignoring the past results As the number of sites are most likely to increase – The load on old sites will be low relatively to new sites – High entrance cost but low maintenance cost

End