Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools for Privacy Preserving Distributed Data Mining

Similar presentations


Presentation on theme: "Tools for Privacy Preserving Distributed Data Mining"— Presentation transcript:

1 Tools for Privacy Preserving Distributed Data Mining
By Michael Holmes

2 Why Private Data Mining
The CDC may want to use data mining techniques to identify trends in disease outbreaks. Insurance companies have useful data but can’t disclose it because of privacy concerns. Is there a way to obtain this data without revealing the identity of the patients?

3 Private Data Mining Techniques
Secure Sum Secure Set Union Secure Size of Set Intersection Scalar Product

4 Private Data Mining Toolkit
Association Rules in horizontally partitioned data Association Rules in vertically partitioned data EM Clustering

5 Secure Sum Securely compute the sum from individual databases.
Have a site randomly generate a number R Add this number to every value and send it to site 2. Site 2 can then add each of it’s values to that values sent from site 1 and return a single number back to Site 1. Site 1 can then remove the random number N times and find the correct sum.

6 Secure Sum

7 Secure Set Union

8 Secure Size of Set Intersection
Only possible with Commutative Encryption. very party encrypts their data and then sends it to another party. The next party also encrypts the encrypted data. After all parties have encrypted all the data from every other party only that has been duplicated by the encryption is shared. Count the duplicates and you know the size of the intersection.

9 Scalar Product Want to compute the sum of x1 * y1 between two databases Use linear combinations of random numbers to disguise elements and then computationally remove these once you get the result.

10 Association Rules in Horizontally Partitioned Data
Candidate Set Generation Local Pruning Itemset Exchange (Secure Union Step here) Support Count Exchange

11 Association Rules in Vertically Partitioned Data
Uses scalar product to determine if the count of an item set is greater than a threshold If the count is above the threshold you’ve determined that the database is worth querying Can also user Secure Size Set Intersection to see how much is in common. Useful when using algorithm such as apriori algorithm

12 EM Clustering Uses secure sum to get a global number associated with all sites involved. Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models.

13 EM Clustering Uses secure sum to get a global number associated with all sites involved. Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models.

14 Things to Note These algorithms are not fully private, some information is learned in the process. For example in the set intersection, sites can potentially learn the sizes of each database. Make sure to pick the appropriate algorithms for what you need to accomplish Watch out for intermediate information being leaked!

15 Thank you


Download ppt "Tools for Privacy Preserving Distributed Data Mining"

Similar presentations


Ads by Google