CS573 Data Privacy and Security

Slides:



Advertisements
Similar presentations
Revisiting the efficiency of malicious two party computation David Woodruff MIT.
Advertisements

Recap: Mining association rules from large datasets
Computational Privacy. Overview Goal: Allow n-private computation of arbitrary funcs. –Impossible in information-theoretic setting Computational setting:
Data Mining Lecture 9.
Secure Evaluation of Multivariate Polynomials
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Types of Algorithms.
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
CS555Topic 241 Cryptography CS 555 Topic 24: Secure Function Evaluation.
Introduction to Modern Cryptography, Lecture 12 Secure Multi-Party Computation.
Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques.
General Cryptographic Protocols (aka secure multi-party computation) Oded Goldreich Weizmann Institute of Science.
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Privacy-Preserving Data Mining
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Data Security against Knowledge Loss *) by Zbigniew W. Ras University of North Carolina, Charlotte, USA.
Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Classification.
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
Chapter 7 Decision Tree.
How to play ANY mental game
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
Secure Computation of the k’th Ranked Element Gagan Aggarwal Stanford University Joint work with Nina Mishra and Benny Pinkas, HP Labs.
Secure Cloud Database using Multiparty Computation.
Secure Incremental Maintenance of Distributed Association Rules.
Tools for Privacy Preserving Distributed Data Mining
Cryptographic methods for privacy aware computing: applications.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Background on security
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Privacy-Preserving Credit Checking Keith Frikken, Mikhail Atallah, and Chen Zhang Purdue University June 7, 2005.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Information Sharing across Private Databases Rakesh Agrawal Alexandre Evfimievski Ramakrishnan Srikant IBM Almaden Research Center.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Privacy-Preserving Data Aggregation without Secure Channel: Multivariate Polynomial Evaluation Taeho Jung 1, XuFei Mao 2, Xiang-Yang Li 1, Shao-Jie Tang.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Cryptographic methods. Outline  Preliminary Assumptions Public-key encryption  Oblivious Transfer (OT)  Random share based methods  Homomorphic Encryption.
Multi-Party Computation r n parties: P 1,…,P n  P i has input s i  Parties want to compute f(s 1,…,s n ) together  P i doesn’t want any information.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Clustering
Relational Algebra Chapter 4, Part A
CS573 Data Privacy and Security
Classification and Prediction
Association Rule Mining
CS573 Data Privacy and Security
Privacy Preserving Data Mining
Relational Algebra Chapter 4, Sections 4.1 – 4.2
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
©Jiawei Han and Micheline Kamber
Presentation transcript:

CS573 Data Privacy and Security Secure Multiparty Computation – Applications for Privacy Preserving Distributed Data Mining Li Xiong CS573 Data Privacy and Security Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT Dallas 1

Distributed Data Mining Government / public agencies. Example: The Centers for Disease Control want to identify disease outbreaks Insurance companies have data on disease incidents, seriousness, patient background, etc. But can/should they release this information? Industry Collaborations / Trade Groups. Example: An industry trade group may want to identify best practices to help members But some practices are trade secrets How do we provide “commodity” results to all (Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?

Classification Data partition Techniques Mining applications Horizontally partitioned Vertically partitioned Techniques Data perturbation Secure Multi-party Computation protocols Mining applications Decision tree Bayes classifier …

Horizontally Partitioned Data Data can be unioned to create the complete set key X1…Xd K1 k2 kn key X1…Xd key X1…Xd key X1…Xd K1 k2 ki Ki+1 ki+2 kj Km+1 km+2 kn Site 1 Site 2 … Site r 4

Vertically Partitioned Data Data can be joined to create the complete set key X1…Xi Xi+1…Xj … Xm+1…Xd key X1…Xi key Xi+1…Xj key Xm+1…Xd Site 1 Site 2 … Site r

Secure Multiparty Computation Goal: Compute function when each party has some of the inputs Secure Can be simulated by ideal model - nobody knows anything but their own input and the results Formally:  polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)} Semi-Honest model: follow protocol, but remember intermediate exchanges Malicious: “cheat” to find something out

Secure Multiparty Computation It can be done! Basic cryptographic tools Oblivious transfer Oblivious circuit evaluation Yao’s Millionaire’s problem (Yao ’86) Secure computation possible if function can be represented as a circuit Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87) Idea: Securely compute gate Continue to evaluate circuit

Why aren’t we done? Secure Multiparty Computation is possible But is it practical? Circuit evaluation: Build a circuit that represents the computation For all possible inputs Impossibly large for typical data mining tasks The next step: Efficient techniques for specialized tasks and computations

Outline Privacy preserving two-party decision tree mining (Lindell & Pinkas ’00) Privacy preserving distributed data mining toolkit (Clifton ’02) Secure sum Secure union Association Rule Mining on Horizontally Partitioned Data (Kantarcioglu ‘04) Privacy preserving data mining, Lindell, 2000 Tools for Privacy Preserving Data Mining, Clifton, 2002 Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004

Decision Tree Construction (Lindell & Pinkas ’00) Two-party horizontal partitioning Each site has same schema Attribute set known Individual entities private Learn a decision tree classifier ID3 Essentially ID3 meeting Secure Multiparty Computation Definitions Semi-honest model

A Decision Tree for “buys_computer” age? overcast student? credit rating? <=30 >40 no yes 31..40 fair excellent A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. Used for classification: Given a tuple with unknown attribute, the attribute values are tested against the tree, and a path is traced from root to the leaf, which holds the classification of the tuple. Decision tree is intuitive and simple. 11 11 11

ID3 Algorithm for Decision Tree Induction Greedy algorithm - tree is constructed in a top-down recursive divide-and- conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions - information gain Samples are partitioned recursively based on selected attributes Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Quinlan developed a decision tree algorithm known as ID3, then later extended into C4.5. Classification and regression trees (CART) is a technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. They are invented independently around the same time, yet follow a similar approach. They form the cornerstone algorihtms. They both adopt a greedy strategy Attributes are categorical (if continuous-valued, they are discretized in advance) Differences include how the attributes are selected and the mechanisms used for pruning. Requires one pass of the database for each level of the tree 12 12 12

Privacy Preserving ID3 Input: R – the set of attributes C – the class attribute T – the set of transactions Step 1: If R is empty, return a leaf-node with the class value with the most transactions in T Set of attributes is public Both know if R is empty Use secure protocol for majority voting Yao’s protocol Inputs (|T1(c1)|,…,|T1(cL)|), (|T2(c1)|,…,|T2(cL)|) Output i where |T1(ci)|+|T2(ci)| is largest

Privacy Preserving ID3 Step 2: If T consists of transactions which have all the same value c for the class attribute, return a leaf node with the value c Represent having more than one class (in the transaction set), by a fixed symbol different from ci, Force the parties to input either this fixed symbol or ci Check equality to decide if at leaf node for class ci Various approaches for equality checking Yao’86 Fagin, Naor ’96 Naor, Pinkas ‘01

Privacy Preserving ID3 P1 and P2 , i.e., x1 and x2, respectively Step 3:(a) Determine the attribute that best classifies the transactions in T, let it be A Essentially done by securely computing x*(ln x) where x is the sum of values from the two parties P1 and P2 , i.e., x1 and x2, respectively Step 3: (b,c) Recursively call ID3 for the remaining attributes on the transaction sets T(a1),…,T(am) where a1,…, am are the values of the attribute A Since the results of 3(a) and the attribute values are public, both parties can individually partition the database and prepare their inputs for the recursive calls decomposed to several steps Each step each party knows only a random share of the result

Security proof tools Real/ideal model: the real model can be simulated in the ideal model Key idea – Show that whatever can be computed by a party participating in the protocol can be computed based on its input and output only  polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)} Intuitively, a party’s view in a protocol execution should be simulatable given only its input and output

Security proof tools Composition theorem if a protocol is secure in the hybrid model where the protocol uses a trusted party that computes the (sub) functionalities, and we replace the calls to the trusted party by calls to secure protocols, then the resulting protocol is secure Prove that component protocols are secure, then prove that the combined protocol is secure

Secure Sub-protocols for PPDDM In general, PPDDM protocols depend on few common sub-protocols. Those common sub-protocols could be re- used to implement PPDDM protocols

Privacy preserving data mining toolkit (Clifton ‘02) Many different data mining techniques often perform similar computations at various stages (e.g., computing sum, counting the number of items) Toolkit simple computations – sum, union, intersection … assemble them to solve specific mining tasks – association rule mining, bayes classifier, … The protocols may not be truly secure but more efficient than traditional SMC methods Tools for Privacy Preserving Data Mining, Clifton, 2002

Toolkit Secure functions Applications Secure sum Secure union … Association rule mining for horizontally partitioned data

Secure Sum Leading site: (v1 + R) mod n Site l receives:

Secure Sum

Secure sum - security Does not reveal the real number Is it secure? Site can collude! Each site can divide the number into shares, and run the algorithm multiple times with permutated nodes

Secure Union Commutative encryption Secure union For any set of permutated keys and a message M For any set of permutated keys and message M1 and M2 Secure union Each site encrypt its items and items from other site, remove duplicates, and decrypt

Secure union

Secure Union Security Does not reveal which item belongs to which site Is it secure under the definition of secure multi-party computation? It reveals the number of items that are common in the sites! Revealing innocuous information leakage allows a more efficient algorithm than a fully secure algorithm

Association Rules Mining Assume data is horizontally partitioned Each site has complete information on a set of entities Same attributes at each site If goal is to avoid disclosing entities, problem is easy Basic idea: Two-Phase Algorithm First phase: Compute candidate rules Frequent globally  frequent at some site Second phase: Compute frequency of candidates

Association Rules in Horizontally Partitioned Data A&B  C Local Data Local Data Mining Combined results Data Mining Combiner Request for local bound-tightening analysis A & B  C A&B  C 4% D  F We can show that if a rule holds globally with some level of support, it MUST hold in at least one database with that level of support. Therefore we can request each site provide its local rules, then ask each site to provide exact support for the rules in the union of the locally supported rules. This will give the same results as if we had access to all the data. One issue is ensuring that the results don’t compromise privacy. Privacy / vetting and release of data mining results is analogous to census department rules for release of aggregate data. It isn’t necessary to ensure that individual rules can be released, instead we develop a policy for release and guarantee that local mining will only provide results that meet that policy.

Association Rule Mining: Horizontal Partitioning What if we do not want to reveal which rule is supported at which site, the support count of each rule, or database sizes? Hospitals want to participate in a medical study But rules only occurring at one hospital may be a result of bad practices

Privacy-preserving Association rule mining for horizontally partitioned data (Kantarcioglu’04) Find the union of the locally large candidate itemsets securely After the local pruning, compute the globally supported large itemsets securely At the end check the confidence of the potential rules securely Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004

Securely Computing Candidates Compute local candidate set Using secure union!

Computing Candidate Sets 1 ABC E1(E2(E3(ABC))) E1(E2(E3(ABD))) E2(E3(ABC)) E2(E3(ABD)) E1(E2(E3(ABC))) E1(E2(ABD)) E1(ABC) ABC ABD E3(ABC) E3(ABD) 2 ABD 3 ABC E2(E3(E1(ABC))) E2(E3(ABC)) E2(ABD) E3(E1(E2(ABD))) E3(E1(ABC)) E3(ABC)

Compute Which Candidates Are Globally Supported? Goal: To check whether X.sup (1) (2) (3) Note that checking inequality (1) is equivalent to checking inequality (3)

Which Candidates Are Globally Supported? (Continued) Securely compute sum then check if sum ≥ 0 Is this a good approach? Sum is disclosed! Securely compute Sum - R Securely compare sum ≥ R? Use oblivious transfer

Computing Frequent: Is ABC ≥ 5%? 1 ABC=18 DBSize=300 ABC: 12+18-.05*300 ABC: 19 ABC: 19 ≥ R? 2 ABC=9 DBSize=200 3 ABC=5 DBSize=100 ABC: YES! R=17 ABC: R+count-freq.*DBSize ABC: 17+9-.05*200 ABC: 17+5-.05*100 ABC: 12 ABC: 17

Computing Confidence Checking confidence can be done by the previous protocol. Note that checking confidence for X  Y

Secure Functionalities Secure Comparison: Comparing two integers without revealing the integer values. Secure Polynomial Evaluation: Party A has polynomial P(x) and Part B has a value b, the goal is to calculate P(b) without revealing P(x) or b Secure Set Intersection: Party A has set SA and Party B has set SB , the goal is to calculate without revealing anything else.

Secure Functionalities Used Secure Set Union: Party A has set SA and Party B has set SB , the goal is to calculate without revealing anything else. Secure Dot Product: Party A has a vector X and Party B has a vector Y. The goal is to calculate X.Y without revealing anything else.

Secure Poly. Evaluation Association Rule Mining Secure Sum Secure Comparison Secure Union Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees EM Clustering Naïve Bayes Classifier Data Mining on Horizontally Partitioned Data Specific Secure Tools

Data Mining on Vertically Partitioned Data Specific Secure Tools Data Mining on Vertically Partitioned Data Secure Comparison Secure Set Intersection Secure Dot Product Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees K-means Clustering Naïve Bayes Classifier Outlier Detection

Summary of SMC Based PPDDM Mainly used for distributed data mining. Provably secure under some assumptions. Efficient/specific cryptographic solutions for many distributed data mining problems are developed. Mainly semi-honest assumption(i.e. parties follow the protocols)

Drawbacks for SMC Based PPDDM Still not efficient enough for very large datasets Semi-honest model may not be realistic Malicious model is even slower Main drawbacks for the SMC approaches for PPDM are: 1) They are not efficient enough for really large data sets and high number of distributed nodes. 2) Semi-honest assumption may not be realistic 3) Malicious model is even slower. Possible future directions : New models. Crypto methods assume either malicious or semi-honest adversaries. What about rational adversaries (i.e. Adversaries that cheat for profit)? Game theoretic models can provide efficient solutions. We may try combining anonymization with SMC techniques for efficient and accurate solutions.

Possible New Directions New models that can trade-off better between efficiency and security Game theoretic / incentive issues in PPDM Combining anonymization and cryptographic techniques for PPDM

References Privacy preserving data mining, Lindell, 2000 Tools for Privacy Preserving Data Mining, Clifton, 2002 Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004