Download presentation
Presentation is loading. Please wait.
Published byKory Wilcox Modified over 8 years ago
1
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell
2
Mining Joint Databases Parties P 1 and P 2 own databases D 1 and D 2 f is a data mining algorithm Compute f(D 1 D 2 ) without revealing “unnecessary information”
3
Unnecessary Information Intuitively, the protocol should function as if a trusted third party computed the output P1P1 P2P2 TTP D1D1 D2D2 f(D1 D2)
4
Simulation Let msg(P 2 ) be P 2 ’s messages If S 1 can simulate msg(P 2 ) to P 1 given only P 1 ’s input and the protocol output, then msg(P 2 ) must not contain unnecessary information (and vice- versa) S 1 (D 1,f(D 1,D 2 )) = C msg(P 2 )
5
More Simulation Details The simulator S 1 can also recover r 1, the internal coin tosses of P 1 Can extend to allow distinct f 1 (x,y) and f 2 (x,y) –Complicates the definition –Not necessary for data mining applications
6
The Semi-Honest Model A malicious adversary can alter his input –f(Ø D 2 ) = f(D 2 ) ! A semi-honest adversary –adheres to protocol –tries to learn extra information from the message transcript
7
General Secure Two Party Computation Any algorithm can be made private (in the semi-honest model) –Yao’s Protocol So, why write this paper? –Yao’s Protocol is inefficient –This paper privately computes a particular algorithm more efficiently
8
Yao’s Protocol (Basically) Convert the algorithm to a circuit P 1 hard codes his input into the circuit P 1 transforms each gate so that it takes garbled inputs to garbled outputs Using 1-out-of-2 oblivious transfer, P 1 sends P 2 garbled versions of his inputs
9
Garbled Wire Values P 1 assigns to each wire i two random values (W i 0,W i 1 ) –Long enough to seed a pseudo-random function F P 1 assigns to each wire i a random permutation over {0,1}, i : b i c i W i b i,c i is the ‘garbled value’ of wire i
10
Garbled Gates Gate g computes b k = g(b i,b j ) Garbled gate is a table T g computing W i b i,c i W j b j,c j W k b k,c k –T g has four entries: –c i,c j : W k g(b i, b j),c k F [W i b i ](c j ) F [W j b j ](c i )
11
Yao’s Protocol P 1 sends –P 2 ’s garbled input bits (1-out-of-2) –T g tables –Table from garbled output values to output bits P 2 can compute output values, but P 1 ’s input and intermediate values appear random
12
Cost of circuit with n inputs and m gates Communication: m gate tables –4m · length of pseudo-random output Computation: n oblivious transfers –Typically much more expensive than the m pseudo-random function applications Too expensive for data mining
13
Classification by Decision Tree Learning A classic machine learning / data mining problem Develop rules for when a transaction belongs to a class based on its attribute values Smaller decision trees are better ID3 is one particular algorithm
14
A Database… OutlookTempHumidityWindPlay Tennis SunnyHotHighWeakNo SunnyHotHighStrongNo OvercastMildHighWeakYes RainMildHighWeakYes RainCoolNormalWeakYes RainCoolNormalStrongNo OvercastCoolNormalStrongYes SunnyMildHighWeakNo SunnyCoolNormalWeakYes RainMildNormalWeakYes SunnyMildNormalStrongYes OvercastMildHighStrongYes OvercastHotNormalWeakYes RainMildHighStrongNo
15
… and its Decision Tree Outlook HumidityWind Yes Sunny Rain Overcast Yes No HighNormalStrongWeak
16
The ID3 Algorithm: Definitions R: The set of attributes –Outlook, Temperature, Humidity, Wind C: the class attribute –Play Tennis T: the set of transactions –The 14 database entries
17
The ID3 Algorithm ID3(R,C,T) If R is empty, return a leaf-node with the most common class value in T If all transactions in T have the same class value c, return the leaf-node c Otherwise, –Determine the attribute A that best classifies T –Create a tree node labeled A, recur to compute child trees –edge a i goes to tree ID3(R - {A},C,T(a i ))
18
The Best Predicting Attribute Entropy! Gain(A) = def H C (T) - H C (T|A) Find A with maximum gain
19
Why can we do better than Yao? Normally, private protocols must hide intermediate values In this protocol, the assignment of attributes to nodes is part of the output and may be revealed –H values are not revealed, just the identity of the attribute with greatest gain This allows genuine recursion
20
How do we do it? Rather than maximize gain, minimize –H’ C (T|A) = def H C (T|A)·|T|·ln 2 This has the simple formula Terms have form (v 1 +v 2 )·ln(v 1 +v 2 ) –P 1 knows v 1, P 2 knows v 2
21
Private x ln x Input: P 1 ’s value v 1, P 2 ’s value v 2 Auxiliary Input: A large field F Output: P 1 obtains w 1 F, P 2 obtains w 2 F –w 1 + w 2 (v 1 + v 2 )·ln(v 1 +v 2 ) –w 1 and w 2 are uniformly distributed in F
22
Private x ln x: some intuition Compute shares of x and ln x, then privately multiply Shares of ln x are actually shares of n and where x = 2 n (1+ ) –-1/2 1/2 –Uses Taylor expansions
23
Using the x ln x protocol For every attribute A, every attribute- value aj A, and every class ci C –w A,1 (a j ), w A,2 (a j ), w A,1 (a j,c i ), w A,2 (a j,c i ) –w A,1 (a j ) + w A,2 (a j ) |T(a j )|·ln(|T(a j )| –w A,1 (a j,c i ) + w A,2 (a j,c i ) |T(a j,c i )|·ln(|T(a j,c i )|
24
Shares of Relative Entropy P 1 and P 2 can locally compute shares S A,1 + S A,2 H’ C (T|A) Now, use the Yao protocol to find the A with minimum Relative Entropy!
25
A Technical Detail The logarithms are only approximate –ID3 algorithm –Doesn’t distinguish relative entropies within
26
Complexity for each node For |R| attributes, m attribute values, and l class values –x ln x protocol is invoked O(m · l · |R|) times –Each requires O(log|T|) oblivious transfers –And bandwidth O(k · log|T| · |S|) bits k depends logarithmically on Depends only logarithmically on |T| Only k·|S| worse that non-private distributed ID3
27
Conclusion Private computation of ID3(D 1 D 2 ) is made feasible Using Yao’s protocol directly would be impractical Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.