Download presentation
Presentation is loading. Please wait.
Published byTyler Hood Modified over 9 years ago
1
Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Chris Baker
2
2 About Me 4th year Undergraduate, Junior Graduating December, 2009 Data Management, Software Engineering Macon, GA (Very) Small Business Owner (Web Development) SCUBA Dive, Travel
3
3 Outline Intro. & Motivation Problem Definition Important Concepts & Examples kNN Algorithm kNN Experiment Conclusion
4
4 Introduction ↓ of information-sharing restrictions due to technology ↑ need for distributed data-mining tools that preserve privacy Trade-off Accuracy EfficiencyPrivacy
5
5 Motivating Scenarios CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background Legal/Commercial Problems prevent release of policy holder's information
6
6 Motivating Scenarios (cont'd) Industrial trade group collaboration Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low failure rate"
7
7 Model: n nodes, horizontal partitioning Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information about other nodes Problem & Assumptions...
8
8 Challenges Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of compromise Why not use secure multi-party computation techniques? High communication overhead Feasible for small inputs only
9
9 Recall Our 3-D Goal Privacy Accuracy Efficiency
10
10 Important Concepts Successor Multi-round Local Computation Randomization (Probabilistic Protocols) Starting Point Private Data D1 Private Data D3 Private Data D2 Private Data Dn … OutputInput OutputInput Output Local Computation Local Computation Local Computation Local Computation
11
11 Multi-Round Protocol Examples Primitive Min/Max Top K Sum Union Join/Intersection Complex kNN Classification K-Means Clustering k-Anonymization
12
12 Naїve Max 1 3 2 4 30 20 40 10 30 40 start Actual Data sent on first pass Static Starting Point Known
13
13 Multi-Round Max Start 183532 4035 D2D2 D3D3 D2D2 D4D4 30 2040 10 183532 4035 0 Randomly perturbed data passed to successor during multiple passes No successor can determine actual data from it's predecessor Randomized Starting Point
14
14 kNN Sub-Problems Nearest-Neighbor Selection ID k nearest neighbors of query x Classification Each node classifies x and cooperates to determine global classification Both parts must function in a privacy-preserving manner. Nodes should not be able to tell what information is from any other node Local classification must not be revealed during global classification
15
15 Parameters Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"
16
16 KNN Classification Algorithm Each node computes the distance between x and each point y in its database, d(x,y), selects k smallest distances (locally), and stores them in a local distance vector ldv. Using ldv as input, the nodes use the privacy preserving nearest distance selection protocol to select k nearest distances (globally), and stores them in gdv. Each node selects the kth nearest distance ∆: ∆ = gdv(k). Assuming there are v classes, each node calculates a local classification vector lcv for all points y in its database:, where d(x,y) is the distance between x and y, f(y) is the classification of point y, and [p] is a function that evaluates to 1 if the predicate p is true, and 0 otherwise. Using lcv as input, the nodes use the privacy preserving classification protocol to calculate the global classification vector gcv. Each node assigns the classification of x as, Input: x, an instance to be classified Output: classification(x), classification of x
17
17 Experimental Datasets GLASS 214 Inst. 7 Classes 9 Attr. PIMA 768 Inst. 8 Classes 8 Attr. ABALONE 4177 Inst. 29 Classes 8 Attr. Broken glass in crime scene investigation Diabetes determination Age prediction of mollusks (mother-of-pearl)
18
18 Accuracy Results
19
19 Varying Rounds
20
20 Privacy Results
21
21 Conclusion Problems Tackled Preserving efficiency and accuracy while introducing provable privacy to the system Constructing k-nearest neighbor classifier over horizontally partitioned databases
22
22 Critique Weakness: assuming honesty (unrealistic) Few/No Illustrations Dependency of Research
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.