CS573 Data Privacy and Security

Slides:



Advertisements
Similar presentations
Computational Privacy. Overview Goal: Allow n-private computation of arbitrary funcs. –Impossible in information-theoretic setting Computational setting:
Advertisements

Data Mining Lecture 9.
Secure Evaluation of Multivariate Polynomials
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
Rational Oblivious Transfer KARTIK NAYAK, XIONG FAN.
Introduction to Modern Cryptography, Lecture 12 Secure Multi-Party Computation.
Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques.
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
1 Introduction to Secure Computation Benny Pinkas HP Labs, Princeton.
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification.
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
CS573 Data Privacy and Security
How to play ANY mental game
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
Secure Computation of the k’th Ranked Element Gagan Aggarwal Stanford University Joint work with Nina Mishra and Benny Pinkas, HP Labs.
Secure Cloud Database using Multiparty Computation.
Tools for Privacy Preserving Distributed Data Mining
Cryptographic methods for privacy aware computing: applications.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Privacy-Preserving Data Aggregation without Secure Channel: Multivariate Polynomial Evaluation Taeho Jung 1, XuFei Mao 2, Xiang-Yang Li 1, Shao-Jie Tang.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Topic 36: Zero-Knowledge Proofs
Chapter 6 Decision Tree.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
Privacy-Preserving Clustering
On the Size of Pairing-based Non-interactive Arguments
Decision Trees (suggested time: 30 min)
MPC and Verifiable Computation on Committed Data
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Privacy-preserving Release of Statistics: Differential Privacy
Data Science Algorithms: The Basic Methods
The first Few Slides stolen from Boaz Barak
Evaluation of Relational Operations
Course Business I am traveling April 25-May 3rd
CS573 Data Privacy and Security
Classification and Prediction
Differential Privacy in Practice
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Privacy Preserving Data Mining
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 11 Limitations of Algorithm Power
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
Compact routing schemes with improved stretch
©Jiawei Han and Micheline Kamber
A task of induction to find patterns
Decision Trees Jeff Storey.
Presentation transcript:

CS573 Data Privacy and Security Secure Multiparty Computation – Applications for Privacy Preserving Distributed Data Mining Li Xiong CS573 Data Privacy and Security Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT Dallas 1

Distributed Data Mining The setting: Data is distributed at different sites These sites may be third parties (e.g., hospitals, government bodies) or may be the individual him or herself Privacy and security constraints: Privacy of individual data Security of data holders xn x1 x3 x2 f(x1,x2,…, xn) Secure computation can be used to solve distributed data-mining problem That is, carry out a secure computation

Distributed Data Mining Government / public agencies. Example: The Centers for Disease Control want to identify disease outbreaks Insurance companies have data on disease incidents, seriousness, patient background, etc. But can/should they release this information? Industry Collaborations / Trade Groups. Example: An industry trade group may want to identify best practices to help members But some practices are trade secrets How do we provide “commodity” results to all (Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?

Taxonomy Data partition Techniques Mining applications Horizontally partitioned Vertically partitioned Techniques Data perturbation Secure Multi-party Computation protocols Mining applications Decision tree Bayes classifier …

Horizontally Partitioned Data Data can be unioned to create the complete set key X1…Xd K1 k2 kn key X1…Xd key X1…Xd key X1…Xd K1 k2 ki Ki+1 ki+2 kj Km+1 km+2 kn Site 1 Site 2 … Site r 5

Vertically Partitioned Data Data can be joined to create the complete set key X1…Xi Xi+1…Xj … Xm+1…Xd key X1…Xi key Xi+1…Xj key Xm+1…Xd Site 1 Site 2 … Site r

Secure Multiparty Computation Goal: Compute function when each party has some of the inputs Secure Can be simulated by ideal model - nobody knows anything but their own input and the results Formally:  polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)} Semi-Honest model: follow protocol, but remember intermediate exchanges Malicious: “cheat” to find something out

Application of SMC to Private Data Mining The aim: Compute the data mining algorithm on the data so that nothing but the output is learned Is it sufficient? xn x1 x3 x2 f(x1,x2,…, xn) Secure computation can be used to solve distributed data-mining problem That is, carry out a secure computation

Privacy and Secure Computation Privacy  Security Secure computation only deals with the process of computing the function It does not ask whether or not the function should be computed A two-stage process: Decide that the function/algorithm should be computed – an issue of privacy (e.g. differential privacy) Apply secure computation techniques to compute it securely – security (focus of this class) Cryptography aims for the following (regarding privacy): A secure protocol must reveal no more information than the output of the function itself That is, the process of protocol computation reveals nothing. Cryptography does not deal with the question of whether or not the function reveals much information E.g., mean of two parties’ salaries Deciding which functions to compute is a different challenge that must also be addressed in the context of privacy preserving data mining. Secure computation: Assume that there is a function that all parties wish to compute Secure computation shows how to compute that function in the safest way possible In particular, it guarantees minimal information leakage (the output only) Privacy: Does the function output itself reveal “sensitive information”, or Should the parties agree to compute this function?

Secure Multiparty Computation Basic cryptographic tools Oblivious transfer Random shares Oblivious circuit evaluation Yao’s Millionaire’s problem (Yao ’86) Secure computation possible if function can be represented as a circuit Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87) Idea: Securely compute gate Continue to evaluate circuit

But we aren’t done yet Circuit evaluation: Build a circuit that represents the computation For all possible inputs Impossibly large for typical data mining tasks Next step: Efficient techniques for specialized tasks and computations Tradeoff between security, efficiency, and accuracy

Outline Privacy preserving two-party decision tree mining (Lindell & Pinkas ’00) Primitive protocols Secure sum Secure union (encryption based) Secure max (probabilistic random response based) Secure data mining using sub protocols Privacy preserving data mining, Lindell, 2000 Tools for Privacy Preserving Data Mining, Clifton, 2002 Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004

Decision Tree Construction (Lindell & Pinkas ’00) Two-party horizontal partitioning Each site has same schema Attribute set known Individual entities private Learn a decision tree classifier ID3 Semi-honest model

A Decision Tree for “buys_computer” age? overcast student? credit rating? <=30 >40 no yes 31..40 fair excellent A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. Used for classification: Given a tuple with unknown attribute, the attribute values are tested against the tree, and a path is traced from root to the leaf, which holds the classification of the tuple. Decision tree is intuitive and simple. 14 14 14

ID3 Algorithm for Decision Tree Induction Greedy algorithm - tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions - information gain Samples are partitioned recursively based on selected attributes Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Quinlan developed a decision tree algorithm known as ID3, then later extended into C4.5. Classification and regression trees (CART) is a technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. They are invented independently around the same time, yet follow a similar approach. They form the cornerstone algorihtms. They both adopt a greedy strategy Attributes are categorical (if continuous-valued, they are discretized in advance) Differences include how the attributes are selected and the mechanisms used for pruning. Requires one pass of the database for each level of the tree 15 15 15

Privacy Preserving ID3 Input: R – the set of attributes C – the class attribute T – the set of transactions Step 1: If R is empty, return a leaf-node with the class value with the most transactions in T Set of attributes is public Both know if R is empty Use secure protocol for majority voting Yao’s protocol Inputs (|T1(c1)|,…,|T1(cL)|), (|T2(c1)|,…,|T2(cL)|) Output i where |T1(ci)|+|T2(ci)| is largest

Privacy Preserving ID3 Step 2: If T consists of transactions which have all the same value c for the class attribute, return a leaf node with the value c Represent having more than one class (in the transaction set), by a fixed symbol different from ci, Force the parties to input either this fixed symbol or ci Check equality to decide if at leaf node for class ci Various approaches for equality checking Yao’86 Fagin, Naor ’96 Naor, Pinkas ‘01

Privacy Preserving ID3 Step 3:(a) Determine the attribute that best classifies the transactions in T, let it be A Each party compute P1 and P2 , i.e., x1 and x2, respectively Essentially done by securely computing x*(ln x) where x is the sum of values from the two parties Step 3: (b,c) Recursively call ID3 for the remaining attributes on the transaction sets T(a1),…,T(am) where a1,…, am are the values of the attribute A Since the results of 3(a) and the attribute values are public, both parties can individually partition the database and prepare their inputs for the recursive calls decomposed to several steps Each step each party knows only a random share of the result

Security proof tools Real/ideal model: the real model can be simulated in the ideal model Key idea – Show that whatever can be computed by a party participating in the protocol can be computed based on its input and output only  polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)} Intuitively, a party’s view in a protocol execution should be simulatable given only its input and output

Security proof tools Composition theorem if a protocol is secure in the hybrid model where the protocol uses a trusted party that computes the (sub) functionalities, and we replace the calls to the trusted party by calls to secure protocols, then the resulting protocol is secure Prove that component protocols are secure, then prove that the combined protocol is secure

Secure Sub-protocols for PPDDM In general, PPDDM protocols depend on few common sub-protocols. Those common sub-protocols could be re-used to implement PPDDM protocols

Privacy preserving data mining toolkit (Clifton ‘02) Many different data mining techniques often perform similar computations at various stages (e.g., computing sum, counting the number of items) Toolkit simple computations – sum, union, intersection … assemble them to solve specific mining tasks – association rule mining, bayes classifier, … The protocols may not be truly secure but more efficient than traditional SMC methods Tools for Privacy Preserving Data Mining, Clifton, 2002

Primitive protocols Secure functions Secure sum Secure union …

Secure Sum Leading site: (v1 + R) mod n Site l receives:

Secure Sum

Secure sum - security Does not reveal the real number Is it secure? Site can collude! Each site can divide the number into shares, and run the algorithm multiple times with permutated nodes

Secure Union Commutative encryption Secure union For any set of permutated keys and a message M For any set of permutated keys and message M1 and M2 Secure union Each site encrypt its items and items from other site, remove duplicates, and decrypt

Secure Union 1 ABC E1(E2(E3(ABC))) E1(E2(E3(ABD))) E2(E3(ABC))

Secure Union Security Does not reveal which item belongs to which site Is it secure under the definition of secure multi-party computation? It reveals the number of items that are common in the sites! Revealing innocuous information leakage allows a more efficient algorithm than a fully secure algorithm

Random Shares based Secure Union Phase 1: random item addition Multiple rounds with permutated ring Each node sends a random share of its item set and a random share of a random item set Phase 2: random item removal Each node subtracts its random items set

Random Shares based Secure Union - Analysis Item exposure attack An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi) Set exposure attack An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai). Change of beliefs (posterior probability and prior probability) P(C|IR,X) - P(C|X) P(C|IR,X)/P(C|X)

Exposure Risk – Set Exposure Disclosure decreases with increasing number of generated random items and increasing number of participating nodes Set exposure risk is or close to 0 for probabilistic and crypto approach Random items generation becomes more important in small network

Exposure Risk – Risk Exposure Item exposure risk decreases with increasing number of generated random items and participating nodes Item exposure risk for probabilistic approach is quite high Random items generation becomes more important in small network

Cost Comparison Commutative protocol and anonymous communication protocol efficient but sensitive to union size Probabilistic protocol efficient but sensitive to domain size Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested

Multi-round protocols Multi-round probabilistic protocols for max/min and top k (Xiong ‘07)

Protocol Structure Random response (Warner 1965) 2018/11/27 Protocol Structure Social Survey Random response (Warner 1965) Multi-round randomized protocol Randomized local computation Multi-node anonymity Assumption: semi-honest model Private Data D1 D3 D2 Dn … Output Input Local Computation Preserving Privacy for outsourcing aggregation services, Xiong, 2007

A Naïve Max/Min Protocol 2018/11/27 A Naïve Max/Min Protocol 1 3 2 4 30 20 40 10 start i gi-1 gi vi gi-1>=vi gi-1<vi gi gi-1 vi Privacy Characteristics Starting node has 100% data value exposure Nodes close to the starting node has a high probability of data value exposure Data range exposure An anonymous protocol can be used to avoid worst case but will not help the average loss of privacy Intuitive idea Add in randomization When, how, and how much randomization to add in? Goal Ensure correct result Minimize data disclosure Add in randomization – how, when, and how much?

Max Protocol – Random response 2018/11/27 Max Protocol – Random response Random response at node i: gi-1>=vi gi-1(r)<vi gi(r) gi-1(r) w/ prob Pr: random number w/ prob 1-Pr: vi i gi-1 gi vi

Max Protocol – multi-round random response 2018/11/27 Max Protocol – multi-round random response Multiple rounds Randomization Probability at round r : Pr(r) = Local algorithm at round r and node i: gi-1(r)>=vi gi-1(r)<vi gi(r) gi-1(r) w/ prob Pr: rand [gi-1(r), vi) w/ prob 1-Pr: vi i gi-1(r) gi(r) vi

Max Protocol - Illustration 2018/11/27 Max Protocol - Illustration Start 18 32 35 D2 D2 30 10 32 35 40 18 32 35 20 40 D4 D3 32 35 40

PrivateTopK Protocol Gi-1(r) Vi Gi(r) 11/27/2018 Gi’(r)=topk(Gi-1(r) U Vi); Vi’ = Gi’(r) – Gi-1’(r); m = |Vi’|; if m=0 then Gi(r)= Gi-1(r); else with probability 1-Pr(r): Gi(r)= Gi-1(r) with probability Pr: Gi(r)[1:k-m] = Gi-1(r)[1:k-m] Gi(r)[k-m+1:k] = a sorted list of m random values generated from [min(Gi’(r)[k]-delta,Gi-1(r-1)[k-m+1]), Gi’(r)[k]) end Gi(r) Gi-1(r) 10 60 80 100 125 145 Vi 40 50 110 130 150 D1 … D2 Dn Multiple rounds Randomization Probability at round r : Pr (r) = Probabilistic local computation 11/27/2018

Min/Max Protocol - Correctness 2018/11/27 Min/Max Protocol - Correctness Precision bound: Converges with r Smaller p0 and d provides faster convergence

Min/Max Protocol - Cost 2018/11/27 Min/Max Protocol - Cost Communication cost single round: O(n) Minimum # of rounds given precision guarantee (1-e):

Min/Max Protocol - Security 2018/11/27 Min/Max Protocol - Security Probability/confidence based metric: P(C|IR,R) Different types of exposures based on claim Data value: vi=a Data ownership: Vi contains a Loss of Privacy (LoP) = | P(C|IR,R) – P(C|R) | Information entropy based metric: Loss of privacy as a measure of randomness of information: H(D|R) - H(D|IR,R) 0.5 1 Absolute Privacy Provable Exposure

Min/Max Protocol – Security (Analysis) 2018/11/27 Min/Max Protocol – Security (Analysis) Upper bound for average expected LoP: max r 1/2r-1 * (1-P0*dr-1) Larger p0 and d provides better privacy

Min/Max Protocol – Security (Experiments) 2018/11/27 Min/Max Protocol – Security (Experiments) Loss of privacy decreases with increasing number of nodes Probabilistic protocol achieves better privacy (close to 0) When n is large, anonymous protocol is actually okay!

Union Commutative encryption based approach Number of rounds: 2 rounds Each round: encryption and decryption Multi-round random-response approach?

Vector Each database has a boolean vector of the data items p1 p2 pc VG 1 b1 b2 bL … 1 1 1 1 … OR OR OR = … … … Each database has a boolean vector of the data items Union vector is a logical OR of all vectors Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

Group Vector Protocol … 1 v1 1 … v2 1 … vc … 1 v1 1 … v2 1 … vc … Processing of VG’ at ps of round r Pex=1/2r, Pin=1-Pex for(i=1; i<L; i++) if (Vs[i]=1 and VG’[i]=0) Set VG’[i]=1 with prob. Pin if (Vs[i]=0 and VG’[i]=1) Set VG’[i]=0 with prob. Pex p1 p2 pc 1 … vG’ … vG’ 1 … vG’ 1 … vG’ 1 … vG’ 1 … vG’ 1 … vG’ r=1, Pex=1/2, Pin=1/2 r=2, Pex=1/4, Pin=3/4

Open issues Tradeoff between accuracy, efficiency, and security How to quantify security How to design adjustable protocols Can we generalize the algorithms for a set of operators based on their properties Operators: sum, union, max, min … Properties: commutative, associative, invertible, randomizable

Secure Functionalities Secure Comparison: Comparing two integers without revealing the integer values. Secure Polynomial Evaluation: Party A has polynomial P(x) and Part B has a value b, the goal is to calculate P(b) without revealing P(x) or b Secure Set Intersection: Party A has set SA and Party B has set SB , the goal is to calculate without revealing anything else.

Secure Functionalities Used Secure Set Union: Party A has set SA and Party B has set SB , the goal is to calculate without revealing anything else. Secure Dot Product: Party A has a vector X and Party B has a vector Y. The goal is to calculate X.Y without revealing anything else.

Secure Poly. Evaluation Association Rule Mining Secure Sum Secure Comparison Secure Union Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees EM Clustering Naïve Bayes Classifier Data Mining on Horizontally Partitioned Data Specific Secure Tools

Data Mining on Vertically Partitioned Data Specific Secure Tools Data Mining on Vertically Partitioned Data Secure Comparison Secure Set Intersection Secure Dot Product Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees K-means Clustering Naïve Bayes Classifier Outlier Detection

Summary of SMC Based PPDDM Mainly used for distributed data mining. Provably secure under some assumptions. Efficient/specific cryptographic solutions for many distributed data mining problems are developed. Mainly semi-honest assumption(i.e. parties follow the protocols)

Ongoing research New models that can trade-off better between efficiency and security Game theoretic / incentive issues in PPDM Combining anonymization and cryptographic techniques for PPDM