Download presentation
Presentation is loading. Please wait.
1
Privacy Preserving Learning of Decision Trees Benny Pinkas HP Labs Joint work with Yehuda Lindell (done while at the Weizmann Institute)
2
Cryptographic methods vs. perturbation methods inaccuracy overhead lack of privacy perturbation methods Cryptographic methods This work…
3
A story We’re experiencing a lot of fraud lately… Here too.. I can’t find a pattern to recognize fraud in advance.. Neither can I.. Maybe we should share information.. But, what about Patients’ privacy Business secrets Have you heard of “Secure function evaluation” ? This is all “theory”. It can’t be efficient.
4
Privacy preserving data mining Confidential database D1 Wish to “mine” D1 D2 without revealing more info Examples: Medical databases protected by law Competing businesses Government agencies (privacy, “need to know”) Confidential database D2 P1P1 P2P2 Huge
5
Secure Function Evaluation [Yao ‘86] x y C(x,y) and nothing else nothing Input: Output: F(x,y) – A public function. Represented as a Boolean circuit C(x,y). Implementation: Two passes O(|X|) “oblivious transfers”. O(|C|) communication. Pretty efficient for small circuits! One Exp per log “OT”s [NP]
6
Our Contribution An efficient sub-linear protocol for secure computation of a complex well-known data- mining alg (ID3), for “semi-honest” parties. A different approach offered by the data- mining community [AS’00]: Perturb each entry (add random noise). Analyze accuracy of using perturbed data as input to data mining algorithms. How much privacy?
7
The classification problem Age > 30Sex How long do we know him/her? Claim > $ 500 Did fraud occur? C1YesM t [0,4] years No C2NoF t [5,9] years Yes ……………… CnYesF t [10,15] years No
8
Classification using Decision Trees history No [0,4] years > 10 years [4,9] years Age > 30 NoYes No Claim > $ 500 NoYes No ID3: Choose attribute A that minimizes the conditional entropy of the attribute class
9
Privacy Preserving ID3 Core of the problem: Comparing entropies while preserving privacy. (entropy = x logx) Privacy: for each party, all intermediate values are random. Efficiency: most computation done independently by parties. Basic task: compute x log x. x = e.g. # of patients with (age > 30) and (fraud = yes)
10
Privacy Preserving ID3 Computing x log x: –x = x1 + x2 known to P1 and P2 respectively (independently computed from databases). –Might as well compute x lnx lnx. –First run a protocol to compute random shares, y1 + y2 = ln x ln x is Real. Crypto works over finite fields. Must do numerical analysis.
11
Cryptographic Tools x Implementation: Two passes, O(degree) (or O( log|F|) ) exponentiations. Q(. ) Q(x) and nothing else nothing Input: Output: Secure Function Evaluation (SFE) [Yao] Oblivious Polynomial Evaluation [NP]
12
Computing random shares of lnx = ln(x 1 +x 2 ) Use Taylor approximation for lnx –x = x 1 + x 2 = 2 n (1+ ) -½ < < ½ –lnx = ln(2 n (1+ )) = ln 2 n + ln(1+ ) ln 2 n + i=1..k (-1) i-1 i / i = ln 2 n + T( ) –T( ) is a polynomial of degree k. Error is exponentially small in k. –We only know how to work over finite fields Work in F, where |F| sufficiently large. Compute c·lnx, where c compensates for fractions.
13
ln(x 1 +x 2 ) Protocol (Cont.) Step 1 of the protocol – Find n, –Apply Yao’s protocol to the following small circuit Input: x 1 and x 2 Output (random shares): random a 1 and a 2 s.t. a 1 + a 2 = x-2 n = ·2 n random b 1 and b 2 s.t. b 1 + b 2 = ln 2 n Operation: The protocol finds 2 n closest to x 1 + x 2, computes 2 n = x 1 + x 2 - 2 n. x = x 1 + x 2 = 2 n + 2 n
14
ln(x 1 +x 2 ) Protocol (Cont) Step 2 of the protocol –Compute random shares of T( ) (Taylor approx.) –P 1 chooses a random w 1 F and defines a polynomial Q(x), s.t. w 1 +Q(a 2 ) = T( ) (recall a 1 + a 2 = ·2 n ) –Namely, Q(x) = T( (a 1 +x)/2 n ) – w 1. –Run an oblivious poly evaluation in which P 2 computes w 2 = Q( a 2 ) = T( ) – w 1. –Now the parties have random w 1 and w 2 s.t. –w 1 + w 2 = T( ) ln(1+ ) –(b 1 + w 1 ) + (b 2 + w 2 ) ln 2 n + ln(1+ ) = ln x
15
Computing x lnx Tool: Multiply(c 1,c 2 ) –Input: c 1, c 2 –Output: d 1, d 2 s.t. d 1 +d 2 = c 1 *c 2 –How? OPE of Q(z) = c 1 *z -d 1 –d 2 = Q(c 2 ) = c 1 *c 2 - d 1 Actual task: x lnx –Input: x 1 +x 2 =x, c 1 +c 2 = ln x –Output: x lnx = (x 1 +x 2 )*(c 1 +c 2 ) –Run Multiply(x 1,c 2 ), Multiply (c 1,x 2 )
16
The rest of the work.. Each party computes a share of the entropy by summing shares of x lnx A small circuit finds the attribute giving the minimal conditional entropy The attribute is assigned to the node The databases are divided according to the value of this attribute
17
Efficiency lnx protocol: –secure computation of a small circuit –one oblivious polynomial evaluation ID3 for a database with: –1,000,000 transactions –15 attributes –10 values per attribute –4 class values –Communication per node takes seconds (T1) –Computation per node takes minutes (P3)
18
Issues Only two participants “Curious but honest” participants Approximating ln x gives an approximation of ID3 The participants learn the decision tree, which reveals some information
19
Contributions A cryptographic protocol where the bulk of the operations is done independently. Data mining –Rigorous model for secure data-mining. –Efficient, secure protocol for ID3. Cryptography –Sub-linear complexity - secure computation for large data sets. –An efficient protocol for a complex known algorithm. –Secure computation of logarithms (real function - numerical analysis).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.