Download presentation
Presentation is loading. Please wait.
1
Sketching in Adversarial Environments Or Sublinearity and Cryptography 1 Moni Naor Joint work with: Ilya Mironov and Gil Segev
2
2 Comparing Streams How to compare data streams without storing them? SBSB SASA Step 1: Compress data on-line into sketches Step 2: Interact using only the sketches Goal: Minimize sketches, update time, and communication
3
3 Comparing Streams Real-life applications: massive data sets, on-line data,... Highly efficient solutions assuming shared randomness $ Shared randomness $ How to compare data streams that cannot to be stored?
4
4 Comparing Streams How to compare data streams that cannot to be stored? Is shared randomness a reasonable assumption? No guarantees when set adversarially Inputs may be adversarially chosen depending on the randomness $ Shared randomness $ Plagiarism detection
5
5 Communication complexity Adversarial sketch model “Adversarial” factors: No secrets Adversarially-chosen inputs Massive data sets: Sketching, streaming The Adversarial Sketch Model
6
6 Goal: Compute f(A,B) Sketch phase An adversary chooses the inputs of the parties Provided as on-line sequences of insert and delete operations No shared secrets The parties are not allowed to communicate Any public information is known to the adversary in advance Adversary is computationally all powerful Interaction phase small sketches, fast updates low communication & computation
7
7 Equality testing A, B µ [N] of size at most K Error probability ² Our Results If we had public randomness… Sketches of size O(log(1/ ² )) Similar update time, communication and computation Equality testing in the adversarial sketch model requires sketches of size (K ¢ log(N/K)) 1/2 Lower Bound
8
8 Equality testing A, B µ [N] of size at most K Error probability ² Equality testing in the adversarial sketch model requires sketches of size (K ¢ log(N/K)) 1/2 Lower Bound Explicit and efficient protocol: Sketches of size (K ¢ polylog(N) ¢ log(1/ ² )) 1/2 Update time, communication and computation polylog(N) Upper Bound Our Results
9
9 (1 + ½ ) -approximation for any constant ½ Sketches of size (K ¢ polylog(N) ¢ log(1/ ² )) 1/2 Update time, communication and computation polylog(N) Explicit construction: polylog(N) -approximation Our Results Symmetric difference approximation A, B µ [N] of size at most K Goal: approximate |A Δ B| with error probability ² Upper Bound
10
10 Outline Lower bound Equality testing Main tool: Incremental encoding Explicit construction using dispersers Symmetric difference approximation Summary & open problems
11
11 Simultaneous Messages Model x y f(x,y)
12
12 x y Simultaneous Messages Model Equality testing in the private-coin SM model requires communication (K ¢ log(N/K)) 1/2 Lower Bound [NS96, BK97] sketches adversarial sketch model
13
13 Outline Lower bound Equality testing Main tool: Incremental encoding Explicit construction using dispersers Symmetric difference approximation Summary & open problems
14
14 Simultaneous Equality Testing x C(x) y C(y) Communication K 1/2 K K 1/2 £ K 1/2
15
15 First Attempt C(A) C(B) row = 3 col = 2C(B) 3,2 Sketches of size K 1/2 Problem : update time K 1/2
16
16 Incrementality vs. Distance Hamming distance Impossible to achieve both properties simultaneously with Hamming distance High distance: For every distinct A,B µ [N] of size at most K, d(C(A),C(B)) > 1 - ² Incrementality: Given C(S) and x 2 [N], the encodings of S [ {x} and S \ {x} are obtained by modifying very few entrie s logarithmic constant
17
17 Incremental Encoding S C(S) 1,..., C(S) r d(C(A),C(B)) = 1 - { 1 – d H (C(A) i,C(B) i ) } i = 1 r r=1 : Hamming distance Hope: Larger r will enable fast updates r corresponds to the communication complexity of our protocol Want to keep r as small as possible Explicit construction with r = logK : Codeword size K ¢ polylog(N) Update time polylog(N) Normalized Hamming distance
18
18 Equality Protocol rows (3,1,1) cols (2,3,1), values { 1 – d H (C(A) i,C(B) i ) } < ² i = 1 r C(A) 1 C(A) 2 C(A) 3 C(B) 3 C(B) 2 C(B) 1 Error probability: 1 – d(C(A), C(B))
19
19 The Encoding Global encoding Map each element to several entries of each codeword Exploit “random-looking” graphs Local encoding Resolve collisions separately in each entry A simple solution when |A Δ B| is guaranteed to be small
20
20 The Local Encoding Suppose that |A Δ B| · ℓ
21
21 Missing Number Puzzle Let S={1,...,N}\{i} – random permutation over S : (1),...., (N) as a one-way stream One number i is missing Goal: Determine the missing number i using O(log N) bits What if there are ℓ missing numbers? Can it be done using O(ℓ ¢ logN) bits?
22
22 The Local Encoding Suppose that |A Δ B| · ℓ Associate each x 2 [N] with v(x) such that for any distinct x 1,...,x ℓ the vectors v(x 1 ),...,v(x ℓ ) are linearly-independent C(S) = v(x) x 2 S If 1 · |A Δ B| · ℓ then C(A) C(B) For example v(x) = (1, x,..., x ℓ-1 ) Size & update time O(ℓ ¢ logN) A simple & well-known solution: Independent of the size of the sets
23
23 The Global Encoding Each element is mapped into several entries of each codeword The content of each entry is locally encoded Universe of size N C1C1 C2C2 C3C3
24
24 The Global Encoding Universe of size N A B 1 2 2 1 2 1 2 1 1 2 Each element is mapped into several entries of each codeword The content of each entry is locally encoded The local guarantee: If 1 · |C i [y] Å (A Δ B)| · ℓ then C(A) and C(B) differ on C i [y] Consider ℓ = 1 C(A) and C(B) differ at least on these entries C 1 [2]
25
25 The Global Encoding Identify each codeword with a bipartite graph G = ([N],R,E) For S µ [N] define (S,ℓ) µ R as the set of all y 2 R for which Universe of size N S (K, ², ℓ) -Bounded-Neighbor Disperser: For any S ½ [N] such that K · |S| · 2K it holds that 1 · | (y) Å S| · ℓ | (S,ℓ)| > (1 - ² )|R| 2 1 2 2 1
26
26 The Global Encoding Universe of size N A B r = logK codewords, each C i is identified with a (2 i, ², ℓ) -BND For i = log 2 |A Δ B| we have d H (C(A) i,C(B) i ) > 1 - ² In particular d(C(A),C(B)) = 1 - { 1 – d H (C(A) i,C(B) i ) } > 1 - ² i = 1 r C1C1 C2C2 C3C3 Bounded-Neighbor Disperser
27
27 Constructing BNDs Codeword of length M Universe of size N Given N and K, want to optimize M, ℓ, ² and the left-degree D Optimal ExtractorDisperser 1polylog(N) log(N/K) M D ℓ 2 (loglogN) 2 K ¢ log(N/K)K ¢ 2 (loglogN) 2 K polylog(N) O(1) (K, ², ℓ) -Bounded-Neighbor Disperser: For any S ½ [N] such that K · |S| · 2K it holds that | (S,ℓ)| > (1 - ² )|R|
28
28 Outline Lower bound Equality testing Main tool: Incremental encoding Explicit construction using dispersers Symmetric difference approximation Summary & open problems
29
29 Symmetric Difference Approximation 1.Sketch input streams into codewords 2.Compare s entries from each pair of codewords d i - # of differing entries sampled from the i-th pair 3.Output APX = (1 + ½ ) i for the maximal i s.t. d i & (1 - ² )s A C(A) 1,..., C(A) k B C(B) 1,..., C(B) k d1d1 dkdk |A Δ B| · APX · (1+ ½ ) ¢ ¢ |A Δ B| KD (1 - ² )M non-explicit: » 1 explicit: polylog(N)
30
30 Outline Lower bound Equality testing Main tool: Incremental encoding Explicit construction using dispersers Symmetric difference approximation Summary & open problems
31
31 Summary Formalized a realistic model for computation over massive data sets Communication complexity Adversarial sketch model “Adversarial” factors: No secrets Adversarially-chosen inputs Massive data sets: Sketching, streaming
32
32 Summary Formalized a realistic model for computation over massive data sets Incremental encoding Main technical contribution Additional applications? Determined the complexity of two fundamental tasks Equality testing Symmetric difference approximation S C(S) 1,..., C(S) r d(C(A),C(B)) = 1 - { 1 – d H (C(A) i,C(B) i ) } i = 1 r
33
33 Open Problems Better explicit approximation for symmetric difference Our (1 + ½ ) -approximation in non-explicit Explicit approximation: polylog(N) Approximating various similarity measures L p norms, resemblance,... Characterizing the class of functions that can be “efficiently” computed in the adversarial sketch model The Power of Adversarial Sketching sublinear sketches polylog updates Possible approach: public-coins to private-coins transformation that “preserves” the update time
34
34 Computational Assumptions Symmetric difference approximation: Not known Even with random oracles! Thank you! Better schemes using computational assumptions? Equality testing: Incremental collision-resistant hashing [BGG ’94] Significantly smaller sketches Existing constructions either have very long public descriptions, or rely on random oracles Practical constructions without random oracles?
35
Can also consider multiple intrusions Pan-Privacy Model Data is stream of items, each item belongs to a user Data of different users interleaved arbitrarily Curator sees items, updates internal state, output at stream end Pan-Privacy For every possible behavior of user in stream, joint distribution of the internal state at any single point in time and the final output is differentially private state output
36
Universe U of users whose data in the stream; x 2 U Streams x -adjacent if same projections of users onto U\{x} Example: axbxcxdxxxex and abcdxe are x-adjacent Both project to abcde Notion of “corresponding locations” in x -adjacent streams U -adjacent: 9 x 2 U for which they are x -adjacent –Simply “adjacent,” if U is understood Note: Streams of different lengths can be adjacent Adjacency: User Level
37
Example: Stream Density or # Distinct Elements Universe U of users, estimate how many distinct users in U appear in data stream Application: # distinct users who searched for “flu” Ideas that don’t work: Naïve Keep list of users that appeared (bad privacy and space) Streaming –Track random sub-sample of users (bad privacy) –Hash each user, track minimal hash (bad privacy)
38
Pan-Private Density Estimator Inspired by randomized response. Store for each user x 2 U a single bit b x Initially all b x 0 w.p. ½ 1 w.p. ½ When encountering x redraw b x 0 w.p. ½-ε 1 w.p. ½+ε Final output: [( fraction of 1 ’s in table - ½ )/ε] + noise Pan-Privacy If user never appeared: entry drawn from D 0 If user appeared any # of times : entry drawn from D 1 D 0 and D 1 are 4ε -differentially private Distribution D 0 Distribution D 1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.