Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Slides:

Advertisements

Similar presentations

Mining Association Rules from Microarray Gene Expression Data.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

Data Mining Techniques Association Rule

Imbalanced data David Kauchak CS 451 – Fall 2013.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

An Agent-Based Approach to Inference Prevention in Distributed Database System Xue Ying Chen Department of Computer Science.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.

Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.

Fast Algorithms for Association Rule Mining

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.

Secure Incremental Maintenance of Distributed Association Rules.

Tools for Privacy Preserving Distributed Data Mining

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.

Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.

EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Data Mining Find information from data data ? information.

Association Rule Mining

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.

1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.

Elsayed Hemayed Data Mining Course

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.

Security in Outsourcing of Association Rule Mining

Privacy-Preserving Data Mining

A Research Oriented Study Report By :- Akash Saxena

Association Rule Mining

Privacy Preserving Data Mining

iSRD Spam Review Detection with Imbalanced Data Distributions

Market Basket Analysis and Association Rules

Presented by : SaiVenkatanikhil Nimmagadda

Presentation transcript:

Privacy-Preserving Databases and Data Mining Yücel SAYGIN

Privacy and data mining There are two aspects of data mining when we look at it from a privacy perspective Being able to mine the data without seeing the actual data Protecting the privacy of people against the misusage of data

How can we protect the sensitive knowledge against data mining? Types of sensitive knowledge that could be extracted via data mining techniques are Patterns (Association rules, sequences) Clusters that describe the data Classification models for prediction

Association Rule Hiding Large amounts of customer transaction data is collected in supermarket chains to find association rules in customer buying patterns lots of research conducted on finding association rules efficiently and tools were developed. Association rule hiding algorithms are deterministic with given support and confidence thresholds Therefore association rules are a good starting point.

Motivating examples Sniffing prozac users

Association Rule Hiding  Rules: “Body ® Head”  Ex1: “Diapher ® Beer”  Ex2: “Internetworking with TCP/IP ® ” Interconnections: bridges, routers,…”  parameters: (support, confidence)  Minimum Support, and Confidence Thresholds are used to prune the non-significant rules

Min. support 50% Min. confidence 70%

Algorithms for Rule Hiding What we try to achieve is: Let D be the source database Let R be the set of significant association rules that are mined from D with certain thresholds Let ri be a sensitive rule in R Transform D into D’ so that all rules in R can still be mined from D’ except ri It was proven that optimal hiding of association rules with minimal side effects is NP-Hard

Heuristic Methods We developed heuristics to deal with the problem. Different techniques are implemented based on: Modifying the database by inserting false data or by removing some data. Inserting unknown values to fuzzify the rules

Basic Approach for Rule Hiding Reduce the support of confidential rules Reduce the confidence of rules This way prevent tools to discover these rules The challenge is the data quality Our metric for data quality is the number of rules that can still be mined and the number of rules that appear as a side effect We developed heuristic algorithms to minimize the newly appearing rules, and to minimize the accidentally hidden rules.

Basics of the heuristic algorithms If we want to remove an item from a transaction to reduce the support or the confidence Which item should we start from Which transaction should we choose to hide the selected item We can either Select an item and a transaction in round robin fashion, I.e., select the next item from the next transaction that supports that item, and move to another item and another transaction. Select the item that will probably have the minimal impact on the other rules

Basics of rule hiding conf(X=>Y) = sup(XY)/sup(X) Decreasing the confidence of a rule can be done by: Increasing the support of X in transactions not supporting Y Decreasing the support of Y in transactions supporting both X and Y Decreasing support of rule can be done by Decreasing the support of the corresponding large itemset XY

Min. support 20% Min. confidence 80%

Hiding AB->C by increasing support of AB

Hiding AB->C by decreasing support of ABC

Hiding AB->C by decreasing the support of C

Rule Hiding by Fuzzification In some applications where publishing wrong data is not acceptable, then unkown values may be inserted to blur the rules. When unknowns values are inserted, support and confidence values would fall into a range instead of a fixed value. Similar heuristics for rule hiding can be employed to minimize the side effects

Support and confidence Becomes a range of values

Classification model as a threat to privacy Document classification for authorship identification Main idea: based on a database of documents and authors, assign the most probable author to a new document It’s a possible threat to privacy when the text needs to stay anonymous

The fact that each author uses a characteristic frequency distribution over words and phrases helps us Feature representation used: T: total number of tokens V: total number of types C: total number of characters Classify the document by a learning algorithm and then try to perturb the classification Classification model as a threat to privacy

Another Motivating Application Given a set of attribute values that are confidential and therefore downgraded by inserting unknown values for the place of actual ones before being released. Can someone build a classification model using the rest of the attributes to predict the hidden value?

Classification Models as a threat to privacy How do we prevent a row to be classified as class C by perturbing the data. Main challenge is that (unlike association rule mining) resulting classification models depend on the technique, selected training data, and pruning methodology. Purpose is to decrease the accuracy of the classification model. Approach : inserting unknown values to the selected attribute values in the rest of the database.

Mining the data without actually seeing it Things that we need to consider are: Data type Data mining technique Data distribution Centralized Distributed (vertically or horizontally)

Reference: Rakesh Agrawal and Ramakrishnan Srikant. “Privacy- Preserving Data Mining”. SIGMOD, 2000, Dallas, TX. They developed a technique for consturcting a classification model on perturbed data. The data is assumed to be stored in a centralized database And it is outsourced to a third party for mining, therefore the confidential values need to be handled The following slides are based on the slides by the authors of the paper above Classification on perturbed data

Reconstruction Problem Original values x 1, x 2,..., x n from probability distribution X (unknown) To hide these values, we use y 1, y 2,..., y n from probability distribution Y Given x 1 +y 1, x 2 +y 2,..., x n +y n the probability distribution of Y Estimate the probability distribution of X.

Intuition (Reconstruct single point) Use Bayes' rule for density functions

Intuition (Reconstruct single point) Use Bayes' rule for density functions

Reconstructing the Distribution Combine estimates of where point came from for all the points: Gives estimate of original distribution.

Reconstruction: Bootstrapping f X 0 := Uniform distribution j := 0 // Iteration number repeat f X j+1 (a) := (Bayes' rule) j := j+1 until (stopping criterion met)

Shown to work in experiments on large data sets.

Algorithms “Global” Algorithm Reconstruct for each attribute once at the beginning “By Class” Algorithm For each attribute, first split by class, then reconstruct separately for each class. See SIGMOD 2000 paper for details.

Experimental Methodology Compare accuracy against – Original: unperturbed data without randomization. – Randomized: perturbed data but without making any corrections for randomization. Test data not randomized. Synthetic data benchmark. Training set of 100,000 records, split equally between the two classes.

Quantifying Privacy Add a random value between -30 and +30 to age. If randomized value is 60 know with 90% confidence that age is between 33 and 87. Interval width  amount of privacy. Example: (Interval Width : 54) / (Range of Age: 100)  54% randomization 90% confidence

Privacy Preserving Distributed Data Mining Consider the case where data is distributed horizontally or vertically to multiple sites. Each site is autonomous and does not want to share their actual data Lets consider the following scenario: There are multiple hospitals that have their own local database, and they would like to participate in a scientific study that will analyze the results of treatements for different patients The privacy concern here is that, a hospital would not like to share the knowledge unless the other site also has it, to protect the privacy of itself and its operation Another scenario: Two bookstores would like to learn what books are sold together so that they make some offers to their companies (Amazon does that actually)

Case study: Association rules How do we mine association rules from distributed sources while preserving the privacy of the data owners? The confidential information in this case is: The data itself The fact that a local site supports a rules with certain confidence and certain support (No company wants to loose competitive advantage, and would not like to reveal anything if it will not benefit from the release of the data) Privacy preserving distributed association rule mining methods use distributed rule mining techniques

Distributed rule mining We know how rules are mined from centralized databases The distributed scenario is similar Consider that we have only two sites S1 and S2, which have databases D1 (with 3 transactions) and D2 (with 5 transactions)

Distributed rule mining We would like to mine the databases as if they are parts of a single centralized database of 8 transactions In order to do this, we need to calculate the local supports For example the local support of A in D1 is 100% The local support of the itemset {A,B,C} in D1 is 66%, and the local support of {A,B,C} in D2 is 40%.

Distributed rule mining Assume that the minimum support threshold is 50% then {A,B,C} is frequent in D1, but it is not frequent in D2. However when we assume that the databases are combined then the support of {A,B,C} in D1 U D2 is 50% which means that an itemset could be locally frequent in one database, but not frequent in another database. And it can be frequent globally In order for an itemset ot be frequent globally, it should be frequent in at least one database

Distributed rule mining The algorithm is based on apriori which prunes the rules by looking at the support Apriori also uses the fact that an itemset is frequent only if all its subsets are frequent Therefore only frequent itemsets should be used to generated larger frequent itemsets

Distributed rule mining 1)The local sites will find their frequent itemsets. 2)They will broadcast the frequent itemsets to each other 3)Individual sites will count the frequencies of the itemsets in their local database 4)They will broadcast the result to every site 5)Every site can now find globally frequent itemsets

Distributed rule mining Ex: 50% min supp threshold ● We will start from a singletons and calculate the frequencies of items ● In D1 A (freq 3), B (freq 2), C (freq 3) are frequent, in D2 A (freq 4), B (freq 3), C (freq 3) are frequent ● They will broadcast the results to each other and each site will update the counts of A, B, C by adding the local counts

Distributed rule mining Ex: 50% min supp threshold ● Each site will eliminate the items that are not globally frequent. In this case all of A, B, C are globally frequent. Now ● Now using the frequent items, each site will generate candidates of size 2 which are {A,B}, {A,C}, {B,C} ● And the same steps will be applied

Now we would like to do the same thing but preserve the privacy of the individual sites The basic notions we need for that are Commutative encryption And Secure multi-party computation An encryption is commutative if the following two equations hold for any given feasible encryption keys K1, K2,... Kn, any M, and any permutations of i,j E Ki1 (... E Kin (M)) = EK Kj1 (...E kjn (M)) For different M1, and M2 the probablity of collusion is very low RSA is a famous commutative encryption technique

A simple application of commutative encryption Assume that person A has salary S1, and person B has salary S2. How can they know wheather their salaries are equal to each other? (without revealing their salaries) Assume that A, and B have their own encryption keys, say K1, and K2. And we go from there!

Distributed PP Association Rule Mining For distributed association rule mining, each site needs to distribute its locally frequent itemsets to the rest of the sites Instead of circulating the actual itemsets, the ecrypted versions are circulated Example: S1 contains A, S2 contains B, S3 contains A. Each of them have their own keys, K1, K2, K3. At the end of step 1, each all sites will have items encrypted by all sites. The encrypted items are then passed to a common site to eliminate the duplicates and to start decryption. This was they will not know who has sent which item. Decryption can now start and after everybody finished decrypting, then they will have the actual items.

Distributed PP Association Rule Mining Now we need to see if the global support of an item is larger than the threshold. We we do not want to reveal the supports, since support of an item is assumed to be confidential. A secure multi-party computation technique is utilized for this Assume that there are three sites, and each of them has {A,B,C} and freq in S1 is 5 (out of 100 transactions), in S2 is 6 (out of 300), and in S3 20 (out of 300), and minimum support is 5%. S1 selects a random number, say 17 S1 adds the difference 5 – 5%x100 to 17 and sends the result (17) to S2 S2 adds 6 – 5%x200 to 17 and sends the result (13) to S3. S3 adds 20 – 5%x300 to 13 and sends the result (18) back to S1 18 > the chosen random number (17), so {A,B,C} is globally frequent.

Distributed PP Association Rule Mining This technique assumes a semi-honest model Where each party follows the rules of the protocol using its correct input, but it is free to later use what it sees during execution of the protocol to compromise security. Cost of encryption is the key issue since it is heavily used in this method.