Cryptographic methods for privacy aware computing: applications.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Clustering II.
PARTITIONAL CLUSTERING
Decision Tree Approach in Data Mining
Decision Tree.
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ensemble Learning: An Introduction
Basic Data Mining Techniques
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
Lecture 5 (Classification with Decision Trees)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
CS573 Data Privacy and Security
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
Secure Cloud Database using Multiparty Computation.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
Tools for Privacy Preserving Distributed Data Mining
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
3D polygonal meshes watermarking using normal vector distributions Suk-Hawn Lee, Tae-su Kim, Byung-Ju Kim, Seong-Geun Kwon.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
ID3 Algorithm Michael Crawford.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Diffie-Hellman (Key Exchange) Protocol Rocky K. C. Chang 9 February 2007.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Prof. Pushpak Bhattacharyya, IIT Bombay1 CS 621 Artificial Intelligence Lecture 12 – 30/08/05 Prof. Pushpak Bhattacharyya Fundamentals of Information.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Cryptographic methods. Outline  Preliminary Assumptions Public-key encryption  Oblivious Transfer (OT)  Random share based methods  Homomorphic Encryption.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Multi-Party Computation r n parties: P 1,…,P n  P i has input s i  Parties want to compute f(s 1,…,s n ) together  P i doesn’t want any information.
CAM: Cloud-Assisted Privacy Preserving Mobile Health Monitoring.
Semi-Supervised Clustering
Privacy-Preserving Data Mining
Privacy-Preserving Clustering
CS573 Data Privacy and Security
Learning.
Privacy Preserving Data Mining
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Presentation transcript:

Cryptographic methods for privacy aware computing: applications

Outline  Review: three basic methods  Two applications Distributed decision tree with horizontally partitioned data Distributed k-means with vertically partitioned data

Three basic methods  1-out-K Oblivious Transfer  Random share  Homomorphic encryption * Cost is the major concern

Two example protocols  The basic idea is Do not release original data Exchange intermediate result  Applying the three basic methods to securely combine them

Building decision trees over horizontally partitioned data  Horizontally partitioned data  Entropy-based information gain  Major ideas in the protocol

Horizontally Partitioned Data  Table with key and r set of attributes keyX1…Xd keyX1…Xd Site 1 keyX1…Xd Site 2 keyX1…Xd Site r… K1 k2 kn K1 k2 ki K i+1 k i+2 kj K m+1 k m+2 kn

Review decision tree algorithm (ID3 algorithm)  Find the cut that maximizes gain certain attribute Ai, sorted v1…vn Certain value in the attribute  For categorical data we use Ai=vi  For numerical data we use Ai<vi Ailabel v1 v2 vn l1 l2 ln cut E(): Entropy of label distribution Choose the attribute/value that gives the highest gain! Ai<vi? yes no Aj<vj? …

Key points  Calculating entropy Ailabel v1 v2 vn l1 l2 ln cutThe key is calculating x log x, where x is the sum of values from the two parties P1 and P2, i.e., x1 and x2, respectively -decomposed to several steps -Each step each party knows only a random share of the result

steps Step1: compute shares for w1 +w2= (x1+x2)ln(x1+x2) * a major protocol is used to compute ln(x1+x2) Step 2: for a condition (Ai, vi), find the random shares for E(S), E(S1) and E(S2) respectively. Step3: repeat step1&2 to all possible (Ai, vi) pairs Step4: a circuit gate to determine which (Ai, vi) pair results in maximum gain. x1 x2 w11 w12 w21 w22 ………… (Ai,vi) with Maximum gain

2. K-means over vertically partitioned data  Vertically partitioned data  Normal K-means algorithm  Applying secure sum and secure comparison among multi-sites in the secure distributed algorithm

Vertically Partitioned Data  Table with key and r set of attributes keyX1…Xi Xi+1…Xj … Xm+1…Xd keyX1…Xi Site 1 keyXi+1…Xj Site 2 keyXm+1…Xd Site r…

Motivation  Naïve approach: send all data to a trusted site and do k-mean clustering there Costly Trusted third party?  Preferable: distributed privacy preserving k-means

Basic K-means algorithm  4 main steps: step1.Randomly select k initial cluster centers (k means) repeat step 2. Assign any point i to its closest cluster center step 3. Recalculate the k means with the new point assignment Until step 4. the k means do not change

Distributed k-means  Why k-means can be done over vertically partitioned data All of the 4 steps are decomposable ! The most costly part (step 2 and 3) can be done locally We will focus on the step 2 ( Assign any point i to its closest cluster center)

step 1  All sites share the index of the initial random k records as the centroids µ11 … µ1i Site 1Site 2Site r… µk1 … µki µ1i+1 … µ1j µki+1 … µkj µ1m …µ1d µkm … µkd µ1µ1 µkµk

Step 2:  Assign any point x to its closest cluster center 1.Calculate distance of point X (X1, X2, … Xd) to each cluster center µ k -- each distance calculation is decomposable! d 2 = [(X 1 - µ k1 ) 2 +… (X i - µ ki ) 2 ] + [(X i+1 - µ ki+1 ) 2 +… (X j - µ kj ) 2 ] + … 2. Compare the k full distances to find the minimum one Site1site2 Partial distances: d1+ d2 + … For each X, each site has a k-element vector that is the result for the partial distance to the k centroids, notated as Xi

Privacy concerns for step 2  Some concerns: Partial distances d1, d2 … may breach privacy (the X i and µ ki ) – need to hide it distance of a point to each cluster may breach privacy – need hide it  Basic ideas to ensure security Disguise the partial distances Compare distances so that only the comparison result is learned Permute the order of clusters so the real meaning of the comparison results is unknown. Need 3 non-colluding sites (P1, P2, Pr)

Secure Computing of Step 2  Stage1: prepare for secure sum of partial distances p1 generate V1+V2 + …Vr = 0, Vi is random k-element vector, used to hide the partial distance for site i Use “Homomorphic encryption” to do randomization: E i (Xi)E i (Vi) = E i (Xi+Vi)  Stage2: calculate secure sum for r-1 parties P1, P3, P4… Pr-1 send their perturbed and permuted partial distances to Pr Pr sums up the r-1 partial distances (including its own part)

Secure Computing of Step 2 * X i contains the partial distances to the k partial centroids at site i * E i (Xi)E i (Vi) = E i (Xi+Vi) : Homomorphic encryption, E i is public key *  (X i ) : permutation function, perturb the order of elements in Xi * V1+V2 + …Vr = 0, Vi is used to hide the partial distances Stage 1Stage 2

 Stage 3: secure_add_and_compare to find the minimum distance Involves only Pr and P2 Use a standard Secure Multiparty Computation protocol to find the result  Stage 4: the index of minimum distance (permuted cluster id) is sent back to P1. P1 knows the permutation function thus knows the original cluster id. P1 broadcasts the cluster id to all parties. K-1 comparisons:

Step 3: can also be done locally  Update partial means µi locally according to the new cluster assignments. X11 … X1i Site 1Site 2Site r… Xn1 … Xni X1i+1 … X1j Xni+1 … Xnj X1m …X1d Xnm … Xnd Cluster 2 Cluster k Cluster labels X21 … X2i Cluster k

Extra communication cost  O(nrk) n : # of records r: # of parties k: # of means  Also depends on # of iterations

Conclusion  It is appealing to have cryptographic privacy preserving protocols  The cost is the major concern It can be reduced using novel algorithms