Privacy-Preserving Data Mining

Slides:



Advertisements
Similar presentations
Revisiting the efficiency of malicious two party computation David Woodruff MIT.
Advertisements

Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Secure Multiparty Computations on Bitcoin
Efficient Two-party and Multiparty Computation against Covert Adversaries Vipul Goyal Payman Mohassel Adam Smith Penn Sate UCLAUC Davis.
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
CS555Topic 241 Cryptography CS 555 Topic 24: Secure Function Evaluation.
Introduction to Modern Cryptography, Lecture 12 Secure Multi-Party Computation.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques.
Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Data Security against Knowledge Loss *) by Zbigniew W. Ras University of North Carolina, Charlotte, USA.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
1 Introduction to Secure Computation Benny Pinkas HP Labs, Princeton.
Fast Algorithms for Association Rule Mining
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
Performance and Scalability: Apriori Implementation.
CS573 Data Privacy and Security
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.
Data mining and machine learning A brief introduction.
Secure Computation of the k’th Ranked Element Gagan Aggarwal Stanford University Joint work with Nina Mishra and Benny Pinkas, HP Labs.
Secure Cloud Database using Multiparty Computation.
Secure Incremental Maintenance of Distributed Association Rules.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Privacy-Aware Personalization for Mobile Advertising
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Tools for Privacy Preserving Distributed Data Mining
Cryptographic methods for privacy aware computing: applications.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Background on security
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Privacy-Preserving Credit Checking Keith Frikken, Mikhail Atallah, and Chen Zhang Purdue University June 7, 2005.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Information Sharing across Private Databases Rakesh Agrawal Alexandre Evfimievski Ramakrishnan Srikant IBM Almaden Research Center.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Privacy-Preserving Data Aggregation without Secure Channel: Multivariate Polynomial Evaluation Taeho Jung 1, XuFei Mao 2, Xiang-Yang Li 1, Shao-Jie Tang.
Cryptographic methods. Outline  Preliminary Assumptions Public-key encryption  Oblivious Transfer (OT)  Random share based methods  Homomorphic Encryption.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Privacy-Preserving Clustering
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
CS573 Data Privacy and Security
Association Rule Mining
Privacy Preserving Data Mining
Farzaneh Mirzazadeh Fall 2007
Presentation transcript:

Privacy-Preserving Data Mining Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Chris Clifton (Purdue University)

Outline Introduction Privacy-Preserving Outlier Detection Privacy-Preserving Data Mining Horizontal / Vertical Partitioning of Data Secure Multi-party Computation Privacy-Preserving Outlier Detection Privacy-Preserving Association Rule Mining Conclusion Security Proofs - Very necessary, Quite complex, Not discussed now, Present in paper 3 bullets are intro of talk less (sub bullets)

Back in the good ol’ days Now Future Back in the good ol’ days Dominick’s Safeway Jewel

A “real” example Ford / Firestone Individual databases Possible to join both databases (find corresponding transactions) Commercial reasons to not share data Valuable corporate information - Cost structures / business structures Ford Explorers with Firestone tires → Tread Separation Problems (Accidents!) Might have been able to figure out a bit earlier (Tires from Decatur, Ill. Plant, certain situations) Complete cost structures / business structures

Public (mis)Perception of Data Mining: Attack on Privacy Fears of loss of privacy constrain data mining Protests over a National Registry In Japan Data Mining Moratorium Act Would stop all data mining R&D by DoD Terrorism Information Awareness ended Data Mining could be key technology Btw.. The other facet is security…

Is Data Mining a Threat? Data Mining summarizes data (Possible?) exception: Anomaly / Outlier detection Summaries aren’t private Or are they? Does generating them raise issues? Data mining can be a privacy solution Data mining enables safe use of private data

Privacy Problems with Data Mining The problem isn’t Data Mining, it is the infrastructure to support it! Japanese registry data already held by prefectures Protests arose over moving to a National registry Total Information Awareness program doesn’t generate new data Goal is to enable use of data from multiple agencies Loss of Separation of Control Increases potential for misuse Find patterns while seeing only your own data!

Privacy-Preserving Data Mining How can we mine data if we cannot see it? Perturbation Agrawal & Srikant, Evfimievski et al. Extremely scalable, approximate results Debate about security properties Cryptographic Lindell & Pinkas, Vaidya & Clifton Completely accurate, completely secure (tight bound on disclosure), appropriate for small number of parties Condensation/Hybrid GIVE TRADEOFF (Accuracy v/s scalability) – Generation I am FIRST to do vertical partitioning (done clustering, classification, assoc rules) Access Control Applied SMC (similar to what we are doing) – many papers, but most restricted to 2 parties Secure Multiparty Computation Proof that this is (theoretically) possible

Assumptions Data distributed Data holders don’t want to disclose data Each data set held by source authorized to see it Nobody is allowed to see aggregate data Knowing all data about an individual violates privacy Data holders don’t want to disclose data Won’t collude to violate privacy

Gold Standard: Trusted Third Party

Horizontal Partitioning of Data CC# Active? Delinquent? Amount Bank of America 123 Yes <$300 324 No $300-500 919 >$1000 Chase Manhattan 3450 Yes <$300 4127 No $300-500 8772 >$1000

Vertical Partitioning of Data Global Database View TID Brain Tumor? Diabetes? Model Battery Medical Records Cell Phone Data Need to give horizontal partitioning RPJ Yes Diabetic CAC No Tumor No PTR RPJ 5210 Li/Ion CAC none PTR 3650 NiCd

Secure Multi-Party Computation (SMC) Given a function f and n inputs, distributed at n sites, compute the result while revealing nothing to any site except its own input(s) and the result. Meaning of security Excepting polynomial predicates – not clear or necessary Skip input problems with semi-honest input

Secure Multi-Party Computation It can be done! Yao’s Millionaire’s problem (Yao ’86) Secure computation possible if function can be represented as a circuit Idea: Securely compute gate Continue to evaluate circuit Extended to multiple parties (BGW/GMW ’87) Biggest Problem - Efficiency Will not work for lots of parties / large quantities of data Efficiency and yao’s Protocol – Maybe use simulation figures from Agrawal and Srikant?? Proof of security: Simulator based approach Mention later

SMC – Models of Computation Semi-honest Model Parties follow the protocol faithfully Malicious Model Anything goes! Provably Secure In either case, input can always be modified No collusion Model No collusion allowed Only sensible for multiple parties Ways of proving security in both kinds of models. Basically, secure protocols exist in both models. - Change .. Have a incentive compatibility slide

Incentive compatibility From a higher level perspective (economic notion) If a party cheats Either party is caught Or party suffers an economic loss Possible for many useful collaboration problems If protocol is incentive compatible, semi-honest model sufficient for security

What is an Outlier? An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O Centralized solution from Knorr and Ng Nested loop comparison Maintain count of objects inside threshold If count exceeds threshold, declare non-outlier and move to next Clever processing order minimizes I/O cost 2 1 1

Privacy-Preserving Solution Key idea: share splitting Computations leave results (randomly) split between parties Only outcome is if the count of points within distance threshold exceeds outlier threshold Requires pairwise comparison of all points But failure to compare all points reveals information about non-outliers This alone makes it possible to cluster points This is a privacy violation Asymptotically equivalent to Knorr & Ng

Solution: Horizontal Partition Compare locally with your own points For remote points, get random share of distance Calculate random share of “exceeds threshold or doesn’t” Sum shares and test if enough “close” points 1.5 32 -31 -0.9 0.3 3 -3 0.9 2.5 -12 12 -0.7 1.5 1 -1 3.2 1 24 -23

Random share of distance x2, y2 local; sum of xy is scalar product Several protocols for share-splitting scalar product (Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama, Atallah’02)

Shares of “Within Threshold” Goal: is x + y ≤ dt ? Essentially Yao’s Millionaires’ problem (Yao’86) Represent function to be computed as circuit Cryptographic protocol gives random shares of each wire Solves “sum of shares from within dt exceeds minimum” as well

Vertically Partitioned Data Each party computes its part of distance Secure comparison (circuit evaluation) gives each party shares of 1/0 (close/not) Sum and compare as with horizontal partitioning

Why is this Secure? Random shares indistinguishable from random values Contain no knowledge in isolation Assuming no collusion – so shares viewed in isolation Number of values (= number of shares) known Nothing new revealed Too few close points is outlier definition This is the desired result No knowledge that can’t be discovered from one’s own input and the result!

Conclusion (Outlier Detection) Outlier detection feasible without revealing anything but the outliers Possibly expensive (quadratic) But more efficient solution for this definition of outlier inherently reveals potential privacy-violating information Key: Privacy of non-outliers preserved Reason why outliers are outliers also hidden Allows search for “unusual” entities without disclosing private information about entities

Association Rules Association rules a common data mining task Find A, B, C such that AB  C holds frequently (e.g. Diapers  Beer) Fast algorithms for centralized and distributed computation Basic idea: For AB  C to be frequent, AB, AC, and BC must all be frequent Require sharing data Secure Multiparty Computation too expensive Have this problem… have sub-block later Make it clear this is beyond 3 items i.e. could have ABCD=>E

Association Rule Mining Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k) A B Support of itemset is defined as number of transactions in which all attributes of the itemset are present For binary data, support =|Ai Λ Bi|. Key A1 k1 1 k2 k3 k4 k5 Key B1 k1 k2 1 k3 k4 k5 {A1, B1} is supported for keys k4, k5. Support is 2.

Association Rule Mining Idea based on TID-list representation of data Represent attribute A as TID-list Atid Support of ABC is | Atid ∩ Btid ∩ Ctid | Use a secure protocol to find size of set intersection to find candidate sets We now know how to compute one of (half a slide on how to compute one freq set from other) Millions of candidate itemsets – wont work --

Cardinality of Set Intersection Use a secure commutative hash function Pohlig-Hellman Encryption Each party generates own encryption key All parties encrypt all the input sets E1(E2(…Ek(X))…) = El(Ei(…Ej(X))…) Result is (# common objects) in all sets No need to decrypt

Cardinality of Set Intersection Hashing All parties hash all sets with their key Initial intersection Each party finds intersection of all sets (except its own) Final intersection Parties exchange the final intersection set, and compute the intersection of all sets Order is permuted in each hashing step. Finally, hashed set is sent to every party except the original set

Computing Size of Intersection 1 X E1(X) E1(E2(Y)) E1(E2(E3(Z))) Z:α,β,κ,λ,γ X∩Y∩Z:λ,β Z:α,β,κ,λ,γ Y∩Z:λ,β Probing attacks ---- possible to design algos to prevent / detect certain kinds of inputs – too many concepts n one slide.. Post 2 Y 3 Z X:α,λ,σ,β E2(E3(Z)) Y:λ,σ,φ,υ,β E3(E1(E2(Y))) E3(E1(X)) E2(E3(E1(X))) E2(Y) E3(Z) X∩Y∩Z:λ,β X∩Y∩Z:λ,β X∩Z:α,β,λ X:α,λ,σ,β Y:λ,σ,φ,υ,β X∩Y:λ,σ,β

Why need an intermediate intersection step? Probing 1 party only interested in a particular item Input set composed of interesting item and junk Output reveals information about the presence / absence of item Solution Intermediate step, every party receives encrypted sets of all other parties (but not its own) If Intersection size lower than a threshold, possibility of probing => Abort protocol (What if the item represents medical records for a celebrity?)h

Proof of Security Proof by Simulation What is known The size of the intersection set Site i learns How it can be simulated Protocol is symmetric, simulating view of one party is sufficient Proof by simulation (explain)

Proof of Security Hashing Intersection Party i receives encrypted set from party i-1 Can use random numbers to simulate this Intersection Party i receives fully hashed sets of all parties

Simulating Fully Encrypted Sets |ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8 ABC 2 AB AC 3-2 =1 4-2 =2 BC 2-2 =0 A B C 6-2-1-2 =1 7-2-1-0 =4 8-2-2-0 =4

A B C R1 R2 R3 R4 R5 R6 R1 R2 R3 R7 R8 R9 R10 R1 R2 R4 R5 R11 R12 R13 Why is this computationally indistinguishable ----- no use w/o this

Optimized version

Association Rule Mining (Revisited) Naïve algorithm => Simply use APRIORI. A single set intersection determines the frequency of a single candidate itemset Thousands of itemsets Key intuition Set Intersection algorithm developed also allows computation of intermediate sets All parties get fully encrypted sets for all attributes Local computation allows efficient discovery of all association rules

Communication Cost k parties, m set size, p frequent attributes k*(2k-2) = O(k2) messages p*(2p-2)*m*encrypted message size = O(p2m) bits k rounds Independent of number of itemsets found Big O estimates Metric is not if this is as efficent as non-privacy preserving computation Right question is if this is sufficiently fast for practical use… Consider dropping Non-Secure Method (esp. if giving actual times) PPl might ask what cost of non-secure method? Make the point its not right metric of comparison Needs to be a solid answer (with appropriate tone!) practice tone being non defensive, etc. Non secure method would be faster, but the right way to think abt it is to think abt if it is practical…

Other Results ID3 Decision Tree learning Association Rules Horizontal Partitioning: Lindell&Pinkas ’00 Also vertical partitioning (Du, Vaidya) Association Rules Horizontal Partitioning: Kantarcıoğlu K-Means / EM Clustering K-Nearest Neighbor Naïve Bayes, Bayes network structure And many more

Challenges What do the results reveal? A general approach (instead of per data mining technique) Experimental results Incentive Compatibility Note: Upcoming book in the Advances in Information Security series by Springer-Verlag

Questions