Download presentation
Presentation is loading. Please wait.
Published byMarybeth Whitehead Modified over 9 years ago
1
Efficient and Private Distance Approximation David Woodruff MIT
2
Outline 1. Two-Party Communication 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation
3
The Communication Model x 2 n y 2 n What is the distance D(x,y) between x and y? For example, if = {0,1}, what is the Hamming Distance? R If = R, what is the L p distance for some p 2 (0, 1 ) ? L p distance is ( i=1 |x i -y i | p ) 1/p AliceBob n
4
Application – Streaming Model 7113734 … Want to mine a massive data stream How many distinct elements? What’s the most frequent item? Is the data uniform or skewed? Elements arranged in adversarial order Algorithms only allowed one pass Goal: low-space algorithms
5
Application – Streaming Model Streaming model CommunicationL ower bounds Space lower bounds Protocols Always Algorithms Distance approximation captures streaming primitives Distinct elements (Hamming), frequent items (L 2 ), skew (L p ) Two-party Communication In this talk, most protocols yield streaming algorithms Thus, communication equals space Thus, communication equals space CommunicationL ower bounds Space lower bounds Often
6
Application – IP session data SourceDestinationBytesDurationProtocol 18.6.7.1 10.6.2.3 11.1.0.6 12.3.1.5 … 19.7.3.2 12.3.4.8 11.6.8.2 14.7.0.1 … 40K 20K 58K 30K … 28 18 22 32 … http ftp http … AT & T collects 100+ GBs of NetFlow everyday
7
Application – IP Session Data AT & T needs to process massive stream of network data Traffic estimation What fraction of network IP addresses are active? Distinct elements computation Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation
8
Application – Secure Datamining For medical research, hospitals wish to mine their joint data Distance approximation is useful in many mining algorithms, e.g., classification and clustering Patient confidentiality imposes strict laws on what information can be shared. Mining cannot leak anything sensitive
9
Issues Exact vs. Approximate Solution Efficiency Communication Complexity Round Complexity Security Neither party learns more than what the solution and his/her input implies about the other party ’ s input
10
Initial Observations To cope with the (n) communication bound, we look for randomized approximation algorithms ExactApproximate Deterministic (n) (folklore) (n) (folklore) Randomized (n) [KS, R] ?
11
Previous Results n n 1-2/p SFE n 1-1/(p-1) n 1-2/p [AMS96, CK04, [AMS96, G05] BJKS02, CKS03] L p, p > 2 n 1/ SFE 1/ 2 1/ [AMS96] folklore L2L2 n 1/2 1/ [FIMNSW01] 1/ 2 1/ [FM79, BJKST02 … ] folklore Hamming Distance Private Communication Complexity Upper Bounds Lower Bounds Communication Complexity Upper Bounds Lower Bounds Output D’ such that for all x,y: Pr[D(x,y) · D’(x,y) · (1+ )D(x,y)] ¸ 2/3
12
Our Results [IW03, W04, IW05, IW06] n n 1-2/p Still open n 1-1/(p-1) n 1-2/p O(n 1-2/p ), 1-round L p, p > 2 n 1/ O(1/ 2 ), O(1)-rounds 1/ 2 1/ (1/ 2 ), 1-round L2L2 n 1/2 1/ O(1/ 2 ), O(1)-rounds 1/ 2 1/ (1/ 2 ), 1-round Hamming Distance Private Communication Complexity Upper Bounds Lower Bounds Communication Complexity Upper Bounds Lower Bounds
13
Outline 1. The Two-Party Communication Model 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation
14
Private L 2 Estimation We improve the n 1/2 upper bound to 1/ 2 for private L 2, and our protocol uses O(1) rounds Optimal up to suppressed logarithmic factors Holds for Hamming distance Speculation that private is much harder than non-private We refute this speculation
15
Security Definition What does privacy mean for distance approximation? Alice does not learn anything about y other than what follows from her input x and D(x,y) What does privacy mean for distance computation? Alice does not learn anything about y other than what follows from x and the approximation D’(x,y) Not Sufficient!! Minimal Requirement Does this work?
16
Security Definition x 2 n y 2 n AliceBob Suppose = {0,1} Set the LSB of D’(x,y) to be y n, and the remaining bits of D’(x,y) to agree with those of D(x,y) D’(x,y) is a +/- 1 approximation, but Alice learns y n, which doesn’t follow from x, D(x,y)
17
Security Definition What does privacy mean for distance approximation? Alice and Bob don’t learn anything about each other’s input other than what follows their own input and D(x,y) D’(x,y) is determined by D(x,y) and the randomness New Requirement Implications How do we model the power of the cheating parties?
18
Security Models x 2 n y 2 n AliceBob Semi-honest: parties follow their instructions but try to learn more than what is prescribed Malicious: parties deviate from the protocol arbitrarily - Use a different input - Force other party to output wrong answer - Abort before other party learns answer Difficult to achieve security in malicious model…
19
Reductions – Yao, GMW, NN Protocol secure in the semi-honest model Protocol secure in the malicious model Efficiency of the new protocol = Efficiency of the old protocol It suffices to design protocols in the semi-honest model The parties follow the instructions of the protocol. Don’t need to worry about “weird” behavior. Just ensure neither party learns anything about the other’s input except what follows from the exact distance
20
Our Protocol Alice:Bob:x = e 1 y = e 2 A first try: randomly sample a few coordinates j, compute (x j – y j ) 2, and scale to estimate ||x-y|| 2 2 Problem: With high probability, all samples return 0, so estimate is 0. A second try: randomly rotate vectors over R n, then try the sampling approach Me 1 Me 2 ||Mx – My|| 2 2 = ||x-y|| 2 2. Now mass is “spread out”, so sampling is effective. Problem: neither party can learn the samples, since with the knowledge of M, this reveals extra information Solution: We build a private sub-protocol to output an estimate from the samples, without revealing the samples Parties need to agree on the rotation M. Can be done with low communication using a PRG Thus, the correctness and desired efficiency of the protocol are easy to verify.
21
Private Sub-protocol Problem: Alice learns My j for some j (Bob is similar) Solution: 1. Use an oblivious masking sampling protocol [FIMNSW] Alice learns My j © b for random mask b, Bob has b Alice does not learn j
22
Private Sub-protocol x 2 n Compute Mx y 2 n Compute My AliceBob M Run oblivious, masking sampling protocol Gets b © (My) j for unknown j Create mask aCreate mask b Gets a © (Mx) j for unknown j
23
Private Sub-protocol AliceBob M Gets b © (My) j for unknown jGets a © (Mx) j for unknown j Has mask aHas mask b Low communication private protocol. Computes (M(x-y) j ) 2, and since j is random, E j [M(x-y) j ] 2 = ||Mx-My|| 2 2 /n = ||x-y|| 2 2 /n
24
Low communication private protocol. Computes (M(x-y) j ) 2, and since j is random, E j [M(x-y) j ] 2 = ||Mx-My|| 2 2 /n = ||x-y|| 2 2 /n Private Sub-protocol Thus, the expectation depends only on the length! 1.Let T be an upper bound on ||x-y|| 2 2 2.The protocol outputs a bit c. 3.Since c is a bit, it is determined from its expectation. Pr[c = 1] = n(M(x-y) j ) 2 / T ¼ ||x-y|| 2 2 /T · 1 Repeat a few times to get tight concentration If most repetitions return c = 0, adjust T, and repeat
25
Wrapup We give an O(1)-round 1/ 2 private protocol for the L 2 distance Optimal up to suppressed logarithmic factors Details Randomness is not true – it’s from a pseudo- random generator against non-uniform machines Parties have bounded precision
26
Outline 1. The Two-Party Communication Model 2. Two Problems 1. Private Euclidean norm estimation 2. Higher norm estimation
27
L p Estimation for p > 2 We improve the n 1-1/(p-1) communication upper bound to n 1-2/p, and our protocol is 1-round Achieving this privately is still an open problem
28
L p Estimation for p > 2 Problem: Rotation doesn’t work for p > 2 L2L2 L4L4 (1, 0)11 (1/2 1/2, 1/2 1/2 )11/2 24 rotation Not clear how to “re-randomize” L p for p > 2 We need a new approach…
29
x 2 {1, …, m} n y 2 {1, …, m} n Alice Bob Strategy 1. Classify coordinates |x j – y j | into buckets 0, [1, 2), [2, 4), …, [2 i, 2 i+1 ), … 2. Estimate size s i of each bucket 3. Output We will approximate ||x-y|| p to within a constant factor One source of error: s i are approximate Another source: values are approximate Overall, still within a constant factor p L p Estimation for p > 2
30
Our Approach: Whenever s i is hard to estimate we can detect this, and set to 0. Otherwise, we estimate it. Problem: Aren’t we undercounting? Answer: No! Hard s i don’t matter! Sometimes! I can help you estimate s i when i is large [CCF-C] No! Can show we need (n) communication if Estimating Bucket Sizes Remaining Problem: Estimate s i = # of coordinates |x j – y j | in the range [2 i, 2 i+1 ) Is this easy?
31
The CountSketch Protocol I have a 1-round B-communication protocol which computes all j for which (x j – y j ) 2 ¸ ||x-y|| 2 2 /B s i large L p ! L 2 Intuition: we can detect very large coordinates, where large is with respect to the L 2 norm - Looks promising! - If s i = O(1), we can compute s i with O(n 1- 2/p ) communication - Looks promising! - If s i = O(1), we can compute s i with O(n 1- 2/p ) communication
32
Random Restriction We would like to estimate s i given that and that we can efficiently output all coordinates j for which Ideas? Not so obvious if s i is large.Randomly restrict to ¼ 1/s i fraction of coordinates j!
33
Random Restriction n 1/4 n 1/3 1 Value Number of coordinates 1 n 1/2 (n) The middle group dominates, but the CountSketch protocol cannot detect it. The reason is that each value in the middle group is small, but the group itself is large. Contributes (n) to ||x-y|| 3 3 Contributes n 1/2 (n 1/4 ) 3 = n 5/4 to ||x-y|| 3 3 Contributes 1 (n 1/3 ) 3 = n to ||x-y|| 3 3
34
Random Restriction We randomly restrict to n 1/2 coordinates n 1/4 n 1/3 1 1 n 1/2 (n) Value Number of coordinates
35
Recap Algorithm 1. Classify coordinates |x j – y j | into buckets 0, [1, 2), [2, 4), …, [2 i, 2 i+1 ), … 2. Estimate size s i of each bucket 3. Output Subroutine 1.Randomly restrict to n/2, n/4, n/8, …, coordinates 2.For each restriction, use CountSketch to retrieve the largest elements. Classify them into groups. 3. Scale back to estimate s i Guarantee either you estimate s i well, or s i is tiny.
36
Wrapup We give a 1-round n 1-2/p -communication protocol Optimal due to lower bounds [AMS, BJKS, CKS] Yields optimal n 1-2/p -space streaming algorithm (resolves [AMS]) Lots of details Naive use of [CCF-C] requires >1 round, but we get 1 round The randomness needed for restrictions cannot be pure for the streaming algorithm. We use a PRG
37
My Other Work Algorithms Longest common/increasing subsequence Computational biology, clustering Complexity theory Graph spanners, locally decodable codes Cryptography Broadcast encyption, torus-based crypto, PIR, inference control, practical secure function evaluation
38
Thank you!
39
The [CCF-C] protocol AliceBob Compute: R = j (x-y) j h(j) = j x j h(j) - j y j h(j) Random linear map h:[n] -> {-1,1} Compute j x j h(j) Then E[h(i)R] = j (x-y) j E[h(i)h(j)] = x j – y j Repeat many times to reduce the variance of the estimator
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.