Efficient and Private Distance Approximation David Woodruff MIT.

Efficient and Private Distance Approximation David Woodruff MIT

Outline 1. Two-Party Communication 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation

The Communication Model x 2  n y 2  n What is the distance D(x,y) between x and y? For example, if  = {0,1}, what is the Hamming Distance? R If  = R, what is the L p distance for some p 2 (0, 1 ) ? L p distance is (  i=1 |x i -y i | p ) 1/p AliceBob n

Application – Streaming Model 7113734 …  Want to mine a massive data stream  How many distinct elements?  What’s the most frequent item?  Is the data uniform or skewed?  Elements arranged in adversarial order  Algorithms only allowed one pass  Goal: low-space algorithms

Application – Streaming Model Streaming model CommunicationL ower bounds Space lower bounds Protocols Always Algorithms Distance approximation captures streaming primitives Distinct elements (Hamming), frequent items (L 2 ), skew (L p ) Two-party Communication In this talk, most protocols yield streaming algorithms Thus, communication equals space Thus, communication equals space CommunicationL ower bounds Space lower bounds Often

Application – IP session data SourceDestinationBytesDurationProtocol 18.6.7.1 10.6.2.3 11.1.0.6 12.3.1.5 … 19.7.3.2 12.3.4.8 11.6.8.2 14.7.0.1 … 40K 20K 58K 30K … 28 18 22 32 … http ftp http … AT & T collects 100+ GBs of NetFlow everyday

Application – IP Session Data  AT & T needs to process massive stream of network data  Traffic estimation What fraction of network IP addresses are active? Distinct elements computation  Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation  Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation

Application – Secure Datamining  For medical research, hospitals wish to mine their joint data  Distance approximation is useful in many mining algorithms, e.g., classification and clustering  Patient confidentiality imposes strict laws on what information can be shared. Mining cannot leak anything sensitive

Issues Exact vs. Approximate Solution Efficiency Communication Complexity Round Complexity Security Neither party learns more than what the solution and his/her input implies about the other party ’ s input

Initial Observations To cope with the  (n) communication bound, we look for randomized approximation algorithms ExactApproximate Deterministic  (n) (folklore)  (n) (folklore) Randomized  (n) [KS, R] ?

Previous Results n n 1-2/p SFE n 1-1/(p-1) n 1-2/p [AMS96, CK04, [AMS96, G05] BJKS02, CKS03] L p, p > 2 n 1/  SFE 1/  2 1/  [AMS96] folklore L2L2 n 1/2 1/  [FIMNSW01] 1/  2 1/  [FM79, BJKST02 … ] folklore Hamming Distance Private Communication Complexity Upper Bounds Lower Bounds Communication Complexity Upper Bounds Lower Bounds Output D’ such that for all x,y: Pr[D(x,y) · D’(x,y) · (1+  )D(x,y)] ¸ 2/3

Our Results [IW03, W04, IW05, IW06] n n 1-2/p Still open n 1-1/(p-1) n 1-2/p O(n 1-2/p ), 1-round L p, p > 2 n 1/  O(1/  2 ), O(1)-rounds 1/  2 1/   (1/  2 ), 1-round L2L2 n 1/2 1/  O(1/  2 ), O(1)-rounds 1/  2 1/   (1/  2 ), 1-round Hamming Distance Private Communication Complexity Upper Bounds Lower Bounds Communication Complexity Upper Bounds Lower Bounds

Outline 1. The Two-Party Communication Model 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation

Private L 2 Estimation  We improve the n 1/2 upper bound to 1/  2 for private L 2, and our protocol uses O(1) rounds  Optimal up to suppressed logarithmic factors  Holds for Hamming distance  Speculation that private is much harder than non-private  We refute this speculation

Security Definition What does privacy mean for distance approximation? Alice does not learn anything about y other than what follows from her input x and D(x,y) What does privacy mean for distance computation? Alice does not learn anything about y other than what follows from x and the approximation D’(x,y) Not Sufficient!! Minimal Requirement Does this work?

Security Definition x 2  n y 2  n AliceBob Suppose  = {0,1} Set the LSB of D’(x,y) to be y n, and the remaining bits of D’(x,y) to agree with those of D(x,y) D’(x,y) is a +/- 1 approximation, but Alice learns y n, which doesn’t follow from x, D(x,y)

Security Definition What does privacy mean for distance approximation? Alice and Bob don’t learn anything about each other’s input other than what follows their own input and D(x,y) D’(x,y) is determined by D(x,y) and the randomness New Requirement Implications How do we model the power of the cheating parties?

Security Models x 2  n y 2  n AliceBob Semi-honest: parties follow their instructions but try to learn more than what is prescribed Malicious: parties deviate from the protocol arbitrarily - Use a different input - Force other party to output wrong answer - Abort before other party learns answer Difficult to achieve security in malicious model…

Reductions – Yao, GMW, NN Protocol secure in the semi-honest model Protocol secure in the malicious model Efficiency of the new protocol = Efficiency of the old protocol It suffices to design protocols in the semi-honest model The parties follow the instructions of the protocol. Don’t need to worry about “weird” behavior. Just ensure neither party learns anything about the other’s input except what follows from the exact distance

Our Protocol Alice:Bob:x = e 1 y = e 2 A first try: randomly sample a few coordinates j, compute (x j – y j ) 2, and scale to estimate ||x-y|| 2 2 Problem: With high probability, all samples return 0, so estimate is 0. A second try: randomly rotate vectors over R n, then try the sampling approach Me 1 Me 2 ||Mx – My|| 2 2 = ||x-y|| 2 2. Now mass is “spread out”, so sampling is effective. Problem: neither party can learn the samples, since with the knowledge of M, this reveals extra information Solution: We build a private sub-protocol to output an estimate from the samples, without revealing the samples Parties need to agree on the rotation M. Can be done with low communication using a PRG Thus, the correctness and desired efficiency of the protocol are easy to verify.

Private Sub-protocol Problem: Alice learns My j for some j (Bob is similar) Solution: 1. Use an oblivious masking sampling protocol [FIMNSW] Alice learns My j © b for random mask b, Bob has b Alice does not learn j

Private Sub-protocol x 2  n Compute Mx y 2  n Compute My AliceBob M Run oblivious, masking sampling protocol Gets b © (My) j for unknown j Create mask aCreate mask b Gets a © (Mx) j for unknown j

Private Sub-protocol AliceBob M Gets b © (My) j for unknown jGets a © (Mx) j for unknown j Has mask aHas mask b Low communication private protocol. Computes (M(x-y) j ) 2, and since j is random, E j [M(x-y) j ] 2 = ||Mx-My|| 2 2 /n = ||x-y|| 2 2 /n

Low communication private protocol. Computes (M(x-y) j ) 2, and since j is random, E j [M(x-y) j ] 2 = ||Mx-My|| 2 2 /n = ||x-y|| 2 2 /n Private Sub-protocol Thus, the expectation depends only on the length! 1.Let T be an upper bound on ||x-y|| 2 2 2.The protocol outputs a bit c. 3.Since c is a bit, it is determined from its expectation. Pr[c = 1] = n(M(x-y) j ) 2 / T ¼ ||x-y|| 2 2 /T · 1 Repeat a few times to get tight concentration If most repetitions return c = 0, adjust T, and repeat

Wrapup We give an O(1)-round 1/  2 private protocol for the L 2 distance Optimal up to suppressed logarithmic factors Details Randomness is not true – it’s from a pseudo- random generator against non-uniform machines Parties have bounded precision

Outline 1. The Two-Party Communication Model 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation

L p Estimation for p > 2 We improve the n 1-1/(p-1) communication upper bound to n 1-2/p, and our protocol is 1-round Achieving this privately is still an open problem

L p Estimation for p > 2 Problem: Rotation doesn’t work for p > 2 L2L2 L4L4 (1, 0)11 (1/2 1/2, 1/2 1/2 )11/2 24 rotation Not clear how to “re-randomize” L p for p > 2 We need a new approach…

x 2 {1, …, m} n y 2 {1, …, m} n Alice Bob Strategy 1. Classify coordinates |x j – y j | into buckets 0, [1, 2), [2, 4), …, [2 i, 2 i+1 ), … 2. Estimate size s i of each bucket 3. Output We will approximate ||x-y|| p to within a constant factor One source of error: s i are approximate Another source: values are approximate Overall, still within a constant factor p L p Estimation for p > 2

Our Approach: Whenever s i is hard to estimate we can detect this, and set to 0. Otherwise, we estimate it. Problem: Aren’t we undercounting? Answer: No! Hard s i don’t matter! Sometimes! I can help you estimate s i when i is large [CCF-C] No! Can show we need  (n) communication if Estimating Bucket Sizes  Remaining Problem: Estimate s i = # of coordinates |x j – y j | in the range [2 i, 2 i+1 )  Is this easy?

The CountSketch Protocol I have a 1-round B-communication protocol which computes all j for which (x j – y j ) 2 ¸ ||x-y|| 2 2 /B s i large L p ! L 2 Intuition: we can detect very large coordinates, where large is with respect to the L 2 norm - Looks promising! - If s i = O(1), we can compute s i with O(n 1- 2/p ) communication - Looks promising! - If s i = O(1), we can compute s i with O(n 1- 2/p ) communication

Random Restriction We would like to estimate s i given that and that we can efficiently output all coordinates j for which Ideas? Not so obvious if s i is large.Randomly restrict to ¼ 1/s i fraction of coordinates j!

Random Restriction n 1/4 n 1/3 1 Value Number of coordinates 1 n 1/2  (n) The middle group dominates, but the CountSketch protocol cannot detect it. The reason is that each value in the middle group is small, but the group itself is large. Contributes  (n) to ||x-y|| 3 3 Contributes n 1/2 (n 1/4 ) 3 = n 5/4 to ||x-y|| 3 3 Contributes 1 (n 1/3 ) 3 = n to ||x-y|| 3 3

Random Restriction We randomly restrict to n 1/2 coordinates n 1/4 n 1/3 1 1 n 1/2  (n) Value Number of coordinates

Recap Algorithm 1. Classify coordinates |x j – y j | into buckets 0, [1, 2), [2, 4), …, [2 i, 2 i+1 ), … 2. Estimate size s i of each bucket 3. Output Subroutine 1.Randomly restrict to n/2, n/4, n/8, …, coordinates 2.For each restriction, use CountSketch to retrieve the largest elements. Classify them into groups. 3. Scale back to estimate s i Guarantee either you estimate s i well, or s i is tiny.

Wrapup We give a 1-round n 1-2/p -communication protocol Optimal due to lower bounds [AMS, BJKS, CKS] Yields optimal n 1-2/p -space streaming algorithm (resolves [AMS]) Lots of details Naive use of [CCF-C] requires >1 round, but we get 1 round The randomness needed for restrictions cannot be pure for the streaming algorithm. We use a PRG

My Other Work Algorithms Longest common/increasing subsequence Computational biology, clustering Complexity theory Graph spanners, locally decodable codes Cryptography Broadcast encyption, torus-based crypto, PIR, inference control, practical secure function evaluation

Thank you!

The [CCF-C] protocol AliceBob Compute: R =  j (x-y) j h(j) =  j x j h(j) -  j y j h(j) Random linear map h:[n] -> {-1,1} Compute  j x j h(j) Then E[h(i)R] =  j (x-y) j E[h(i)h(j)] = x j – y j Repeat many times to reduce the variance of the estimator

Efficient and Private Distance Approximation David Woodruff MIT.

Similar presentations

Presentation on theme: "Efficient and Private Distance Approximation David Woodruff MIT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient and Private Distance Approximation David Woodruff MIT.

Similar presentations

Presentation on theme: "Efficient and Private Distance Approximation David Woodruff MIT."— Presentation transcript:

Similar presentations

About project

Feedback