Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Chris Baker
2 About Me 4th year Undergraduate, Junior Graduating December, 2009 Data Management, Software Engineering Macon, GA (Very) Small Business Owner (Web Development) SCUBA Dive, Travel
3 Outline Intro. & Motivation Problem Definition Important Concepts & Examples kNN Algorithm kNN Experiment Conclusion
4 Introduction ↓ of information-sharing restrictions due to technology ↑ need for distributed data-mining tools that preserve privacy Trade-off Accuracy EfficiencyPrivacy
5 Motivating Scenarios CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background Legal/Commercial Problems prevent release of policy holder's information
6 Motivating Scenarios (cont'd) Industrial trade group collaboration Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low failure rate"
7 Model: n nodes, horizontal partitioning Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information about other nodes Problem & Assumptions...
8 Challenges Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of compromise Why not use secure multi-party computation techniques? High communication overhead Feasible for small inputs only
9 Recall Our 3-D Goal Privacy Accuracy Efficiency
10 Important Concepts Successor Multi-round Local Computation Randomization (Probabilistic Protocols) Starting Point Private Data D1 Private Data D3 Private Data D2 Private Data Dn … OutputInput OutputInput Output Local Computation Local Computation Local Computation Local Computation
11 Multi-Round Protocol Examples Primitive Min/Max Top K Sum Union Join/Intersection Complex kNN Classification K-Means Clustering k-Anonymization
12 Naїve Max start Actual Data sent on first pass Static Starting Point Known
13 Multi-Round Max Start D2D2 D3D3 D2D2 D4D Randomly perturbed data passed to successor during multiple passes No successor can determine actual data from it's predecessor Randomized Starting Point
14 kNN Sub-Problems Nearest-Neighbor Selection ID k nearest neighbors of query x Classification Each node classifies x and cooperates to determine global classification Both parts must function in a privacy-preserving manner. Nodes should not be able to tell what information is from any other node Local classification must not be revealed during global classification
15 Parameters Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"
16 KNN Classification Algorithm Each node computes the distance between x and each point y in its database, d(x,y), selects k smallest distances (locally), and stores them in a local distance vector ldv. Using ldv as input, the nodes use the privacy preserving nearest distance selection protocol to select k nearest distances (globally), and stores them in gdv. Each node selects the kth nearest distance ∆: ∆ = gdv(k). Assuming there are v classes, each node calculates a local classification vector lcv for all points y in its database:, where d(x,y) is the distance between x and y, f(y) is the classification of point y, and [p] is a function that evaluates to 1 if the predicate p is true, and 0 otherwise. Using lcv as input, the nodes use the privacy preserving classification protocol to calculate the global classification vector gcv. Each node assigns the classification of x as, Input: x, an instance to be classified Output: classification(x), classification of x
17 Experimental Datasets GLASS 214 Inst. 7 Classes 9 Attr. PIMA 768 Inst. 8 Classes 8 Attr. ABALONE 4177 Inst. 29 Classes 8 Attr. Broken glass in crime scene investigation Diabetes determination Age prediction of mollusks (mother-of-pearl)
18 Accuracy Results
19 Varying Rounds
20 Privacy Results
21 Conclusion Problems Tackled Preserving efficiency and accuracy while introducing provable privacy to the system Constructing k-nearest neighbor classifier over horizontally partitioned databases
22 Critique Weakness: assuming honesty (unrealistic) Few/No Illustrations Dependency of Research