Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques.

Slides:



Advertisements
Similar presentations
© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.
Advertisements

I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
Distribution and Revocation of Cryptographic Keys in Sensor Networks Amrinder Singh Dept. of Computer Science Virginia Tech.
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Basic Data Mining Techniques Chapter Decision Trees.
Ensemble Learning: An Introduction
Privacy-Preserving Cross-Domain Network Reachability Quantification
Basic Data Mining Techniques
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
1 Introduction to Secure Computation Benny Pinkas HP Labs, Princeton.
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
Practical Private Computation and Zero- Knowledge Tools for Privacy-Preserving Distributed Data Mining Yitao Duan and John Canny
Chapter 5 Data mining : A Closer Look.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
CS573 Data Privacy and Security
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
On comparison of different approaches to the stability radius calculation Olga Karelkina Department of Mathematics University of Turku MCDM 2011.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
Secure Cloud Database using Multiparty Computation.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Secure Incremental Maintenance of Distributed Association Rules.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Tools for Privacy Preserving Distributed Data Mining
Cryptographic methods for privacy aware computing: applications.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Additive Data Perturbation: the Basic Problem and Techniques.
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Illustration: 3-Party Secure Sum Compare, match, and analyze data from different organizations without disclosing the private data to any other party Experimental.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Privacy preserving data mining – multiplicative perturbation techniques Li Xiong CS573 Data Privacy and Anonymity.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
Secure Data Outsourcing
Zhengli Huang and Wenliang (Kevin) Du
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Privacy-Preserving Data Aggregation without Secure Channel: Multivariate Polynomial Evaluation Taeho Jung 1, XuFei Mao 2, Xiang-Yang Li 1, Shao-Jie Tang.
IIIT Hyderabad Private Outlier Detection and Content based Encrypted Search Nisarg Raval MS by Research, CSE Advisors : Prof. C. V. Jawahar & Dr. Kannan.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
CS573 Data Privacy and Security
CS573 Data Privacy and Security
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Multiplicative Data Perturbations (1)
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Multiplicative data perturbation (2)
Presentation transcript:

Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Outline Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00) Primitive SMC protocols – Secure sum – Secure union (encryption based) – Secure max (probabilistic random response based) – Secure union (probabilistic and randomization based) Secure data mining using sub protocols Random response for privacy preserving data mining or data sanitization

Random response protocols Multi-round probabilistic protocols Randomization probability associated with each round Random response with randomization probability

Multiple rounds Randomization Probability at round r : – Pr(r) = Local algorithm at round r and node i: 4 Max Protocol – multi-round random response g i-1 (r)>=vig i-1 (r)<vi g i (r)g i-1 (r)w/ prob Pr: rand [g i-1 (r), v i ) w/ prob 1-Pr: v i i g i-1 (r)g i (r) vivi

5 Max Protocol - Illustration Start D2D2 D3D3 D2D2 D4D

6 Min/Max Protocol - Correctness Precision bound: – Converges with r – Smaller p0 and d provides faster convergence

7 Min/Max Protocol - Cost Communication cost – single round: O(n) – Minimum # of rounds given precision guarantee (1-e):

8 Min/Max Protocol - Security Probability/confidence based metric: P(C|IR,R) – Different types of exposures based on claim Data value: v i =a Data ownership: Vi contains a – Change of beliefs P(C|IR,R) – P(C|R) P(C|IR, R) / P(C|R) Relationship to privacy in anonymization – Change of beliefs P(C|D*, BR) – P(C|BR) Absolute Privacy Provable Exposure

9 Min/Max Protocol – Security (Analysis) Upper bound for average expected change of beliefs: max r 1/2 r-1 * (1-P 0 *d r-1 ) Larger p0 and d provides better privacy

10 Loss of privacy decreases with increasing number of nodes Probabilistic protocol achieves better privacy (close to 0) When n is large, anonymous protocol is actually okay! Min/Max Protocol – Security (Experiments)

Union Commutative encryption based approach – Number of rounds: 2 rounds – Each round: encryption and decryption Multi-round random-response approach?

Vector Each database has a boolean vector of the data items Union vector is a logical OR of all vectors b1b1 b2b2 bLbL … p1p … p2p … pcpc OR … = VGVG … Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

Group Vector Protocol … … vG’vG’ … vG’vG’ r=1, P ex =1/2, P in =1/2 P ex =1/2 r, P in =1-P ex for(i=1; i<L; i++) if (V s [i]=1 and V G ’[i]=0) Set V G ’[i]=1 with prob. P in if (V s [i]=0 and V G ’[i]=1) Set V G ’[i]=0 with prob. P ex Processing of V G ’ at p s of round r … v1v … v2v … vcvc r=2, P ex =1/4, P in =3/ … vG’vG’ … vG’vG’ … vG’vG’ … vG’vG’ … vG’vG’ p1p1 p2p2 pcpc

Random Shares based Secure Union Phase 1: random item addition – Multiple rounds with permutated ring – Each node sends a random share of its item set and a random share of a random item set Phase 2: random item removal – Each node subtracts its random items set 14

Random Shares based Secure Union - Analysis Item exposure attack – An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi) Set exposure attack – An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai). Change of beliefs (posterior probability and prior probability) – P(C|IR,X) - P(C|X) – P(C|IR,X)/P(C|X) 15

Exposure Risk – Set Exposure Disclosure decreases with increasing number of generated random items and increasing number of participating nodes Set exposure risk is or close to 0 for probabilistic and crypto approach 16

Exposure Risk – Risk Exposure Item exposure risk decreases with increasing number of generated random items and participating nodes Item exposure risk for probabilistic approach is quite high 17

Cost Comparison Commutative protocol and anonymous communication protocol efficient but sensitive to union size Probabilistic protocol efficient but sensitive to domain size Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested 18

Open issues Tradeoff between accuracy, efficiency, and security How to quantify security How to design adjustable protocols Can we generalize the random-response algorithms and randomization algorithms for operators based on their properties Operators: sum, union, max, min … Properties: commutative, associative, invertible, randomizable

Secure Sum Secure Comparison Secure Union Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees EM Clustering Naïve Bayes Classifier Data Mining on Horizontally Partitioned Data Specific Secure Tools

Secure Comparison Secure Set Intersection Secure Dot Product Secure Logarithm Secure Poly. Evaluation Association Rule Mining Decision Trees K-means Clustering Naïve Bayes Classifier Outlier Detection Data Mining on Vertically Partitioned Data Specific Secure Tools

Summary of SMC Based PPDDM Mainly used for distributed data mining. Efficient/specific cryptographic solutions for many distributed data mining problems are developed. Random response or randomization based protocols offer tradeoff between accuracy, efficiency, and security Mainly semi-honest assumption(i.e. parties follow the protocols)

Ongoing research New models that can trade-off better between efficiency and security Game theoretic / incentive issues in PPDM

Outline Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00) Primitive SMC protocols – Secure sum – Secure union (encryption based) – Secure max (probabilistic random response based) – Secure union (probabilistic and randomization based) Secure data mining using sub protocols Random response for privacy preserving data mining or data collection

Data Collection Model Data cannot be shared directly because of privacy concern

Randomized Response Do you smoke? Head Tail No Yes The true answer is “Yes” Biased coin:

Randomized Response Multiple attributes encoded in bits Head Tail False answer !E: 001 True answer E: 110 Biased coin: Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Generalization for Multi-Valued Categorical Data True Value: S i S i S i+1 S i+2 S i+3 q1 q2 q3 q4 M

A Generalization RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05] RR Matrix can be arbitrary Can we find optimal RR matrices? OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix? Which of the following is better?

What is an optimal matrix? Which of the following is better? Privacy: M 2 is better Utility: M 1 is better So, what is an optimal matrix?

Optimal RR Matrix An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). – Privacy Quantification – Utility Quantification A number of privacy and utility metrics have been proposed. – Privacy: how accurately one can estimate individual info. – Utility: how accurately we can estimate aggregate info.

Optimization Methods Approach 1: Weighted sum: w 1 Privacy + w 2 Utility Approach 2 – Fix Privacy, find M with the optimal Utility. – Fix Utility, find M with the optimal Privacy. – Challenge: Difficult to generate M with a fixed privacy or utility. Proposed Approach: Multi-Objective Optimization

Optimization algorithm Evolutionary Multi-Objective Optimization (EMOO) The algorithm – Start with a set of initial RR matrices – Repeat the following steps in each iteration Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics

Illustration

Output of Optimization Privacy Utility Worse Better M1M1 M2M2 M4M4 M3M3 M5M5 M7M7 M6M6 M8M8 The optimal set is often plotted in the objective space as Pareto front.

For First attribute of Adult data

Summary Privacy preserving data mining – Secure multi-party computation protocols – Random response techniques for computation and data collection Knowledge sensitive data mining