Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)

Similar presentations


Presentation on theme: "Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)"— Presentation transcript:

1 Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Chris Baker

2 2 About Me 4th year Undergraduate, Junior Graduating December, 2009 Data Management, Software Engineering Macon, GA (Very) Small Business Owner (Web Development) SCUBA Dive, Travel

3 3 Outline Intro. & Motivation Problem Definition Important Concepts & Examples kNN Algorithm kNN Experiment Conclusion

4 4 Introduction ↓ of information-sharing restrictions due to technology ↑ need for distributed data-mining tools that preserve privacy Trade-off Accuracy EfficiencyPrivacy

5 5 Motivating Scenarios CDC needs to study insurance data to detect disease outbreaks  Disease incidents  Disease seriousness  Patient Background Legal/Commercial Problems prevent release of policy holder's information

6 6 Motivating Scenarios (cont'd) Industrial trade group collaboration  Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates"  Trade secret: "manufacturing process Y gives low failure rate"

7 7 Model: n nodes, horizontal partitioning Assume Semi-honesty:  Nodes follow specified protocol  Nodes attempt to learn additional information about other nodes Problem & Assumptions...

8 8 Challenges Why not use a Trusted Third Party (TTP)?  Difficult to find one that is trusted  Increased danger from single point of compromise Why not use secure multi-party computation techniques?  High communication overhead  Feasible for small inputs only

9 9 Recall Our 3-D Goal Privacy Accuracy Efficiency

10 10 Important Concepts Successor Multi-round Local Computation Randomization (Probabilistic Protocols) Starting Point Private Data D1 Private Data D3 Private Data D2 Private Data Dn … OutputInput OutputInput Output Local Computation Local Computation Local Computation Local Computation

11 11 Multi-Round Protocol Examples Primitive  Min/Max  Top K  Sum  Union  Join/Intersection Complex  kNN Classification  K-Means Clustering  k-Anonymization

12 12 Naїve Max 1 3 2 4 30 20 40 10 30 40 start Actual Data sent on first pass Static Starting Point Known

13 13 Multi-Round Max Start 183532 4035 D2D2 D3D3 D2D2 D4D4 30 2040 10 183532 4035 0 Randomly perturbed data passed to successor during multiple passes No successor can determine actual data from it's predecessor Randomized Starting Point

14 14 kNN Sub-Problems Nearest-Neighbor Selection  ID k nearest neighbors of query x Classification  Each node classifies x and cooperates to determine global classification Both parts must function in a privacy-preserving manner.  Nodes should not be able to tell what information is from any other node  Local classification must not be revealed during global classification

15 15 Parameters Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"

16 16 KNN Classification Algorithm Each node computes the distance between x and each point y in its database, d(x,y), selects k smallest distances (locally), and stores them in a local distance vector ldv. Using ldv as input, the nodes use the privacy preserving nearest distance selection protocol to select k nearest distances (globally), and stores them in gdv. Each node selects the kth nearest distance ∆: ∆ = gdv(k). Assuming there are v classes, each node calculates a local classification vector lcv for all points y in its database:, where d(x,y) is the distance between x and y, f(y) is the classification of point y, and [p] is a function that evaluates to 1 if the predicate p is true, and 0 otherwise. Using lcv as input, the nodes use the privacy preserving classification protocol to calculate the global classification vector gcv. Each node assigns the classification of x as, Input: x, an instance to be classified Output: classification(x), classification of x

17 17 Experimental Datasets GLASS  214 Inst.  7 Classes  9 Attr. PIMA  768 Inst.  8 Classes  8 Attr. ABALONE  4177 Inst.  29 Classes  8 Attr. Broken glass in crime scene investigation Diabetes determination Age prediction of mollusks (mother-of-pearl)

18 18 Accuracy Results

19 19 Varying Rounds

20 20 Privacy Results

21 21 Conclusion Problems Tackled  Preserving efficiency and accuracy while introducing provable privacy to the system  Constructing k-nearest neighbor classifier over horizontally partitioned databases

22 22 Critique Weakness: assuming honesty (unrealistic) Few/No Illustrations Dependency of Research


Download ppt "Mining Multiple Private Databases Topk Queries Across Multiple Private Databases (2005) Mining Multiple Private Databases Using a kNN Classifier (2007)"

Similar presentations


Ads by Google