Download presentation
Presentation is loading. Please wait.
1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 http://www.ee.technion.ac.il/courses/049011
2
2 Data Streams (cont.)
3
3 Outline Distinct elements L p norms Notation: for integers a < b, [a,b] = {a, a+1, …, b}
4
4 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x [1,m] n Goal: find D = number of distinct elements of x Exact algorithms: need (m) bits of space Deterministic algorithms: need (m) bits of space Approximate randomized algorithms: O(log m) bits of space
5
5 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a “random hash function” h: [1,m] [1,M] h(1),…,h(m) are chosen uniformly and independently from [1,M] Since M >> m 2, probability of collisions is tiny 1. min M 2. for i = 1 to n do 3. read x i from stream 4. if h(x i ) < min, min h(x i ) 5. output M/min
6
6 Distinct Elements: Analysis Space: O(log M) = O(log m) for min O(m log M) = O(m log m) for h Too much! Worse than the naïve O(m) space algorithm Next: show how to use more “space- efficient” hash functions
7
7 Small Families of Hash Functions H = {h | h: [1,m] [1,M] }: a family of hash functions |H| = O(m c ) for some constant c Therefore, each h H can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. How do we make sure H has the “random-like” properties of random hash functions?
8
8 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if: For all x y [1,m] and for all z,w [1,M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions: For each x, h(x) is uniform in [1,M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise-independent random variables k-universal families: straightforward generalization
9
9 Construction of a Universal Family Suppose M = prime power [1,M] can be viewed as a finite field F M [1,m] can be viewed as elements of F M H = { h a,b | a,b F M } is defined as: h a,b (x) = ax + b Note: |H| = M 2 If x y F M and z,w F m, then h a,b (x) = z and h a,b (y) = w iff Since x y, the above system has a unique solution Hence, Pr a,b [h a,b (x) = z and h a,b (y) = w] = 1/M 2.
10
10 Distinct Elements, 2 nd Attempt Use 2-universal hash functions rather than random hash function Space: O(log m) for tracking the minimum O(log m) for storing the hash function Correctness: Part 1: h(a 1 ),…,h(a D ) are still uniform in [1,M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not. Part 2: h(a 1 ),…,h(a D ) are still uniform in [1,M] Main point: variance of pairwise independent variables is additive:
11
11 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm: Find the t = O(1/ 2 ) smallest elements, rather than just the smallest one. If v is the largest among these, output tM/v Space: O(1/ 2 log m) Better algorithm: O(1/ 2 + log m)
12
12 L p Norms Input: an integer vector x [-m,+m] n Goal: find ||x|| p = L p norm of x Popular instantiations: L 2 : Euclidean distance L 1 : Manhattan distance L : max L 0 : # of non-zeros (assuming 1/0 = 1, 0 0 = 0) Not a norm Data stream algorithm: Can be done trivially in O(log m) space
13
13 L p Norms: The “Cash Register” Model Input: a sequence X of N pairs (i 1,a 1 ),…,(i N,a N ) For each j, i j {1,…,n} For each j, a j [-m,m] Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1) For each i = 1,…,n, let S i = { j | i j = i } Ex: S 1 = {1,3}, S 2 = {4,5}, S 3 = {2} Define: x i = j S i a j Ex: x 1 = -2, x 2 = 5, x 3 = -2 Goal: find ||x|| p = L p norm of x
14
14 L p Norms in the “Cash Register” Model: Applications Standard L p norms L p distances Input: two vectors x,y [-m,+m] n (interleaved arbitrarily) Goal: find ||x – y|| p Frequency moments: Input: a vector X [1,n] N Ex: X = (1 2 3 1 1 2) For each i = 1,…,n, define: x i = frequency of i in X Ex: x 1 = 3, x 2 = 2, x 3 = 1 Goal: output ||x|| p Special cases: p = : Most frequent element p = 0: Distinct elements
15
15 L p Norms: State of the Art Results 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00] 2 < p < : O(n 1-2/p log m) space algorithm [Indyk,Woodruff 05] (n 1-2/p-o(1) ) space lower bound [Saks, Sun 02], [Bar- Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03] p = : O(n) space algorithm [Alon,Matias,Szegedy 96] (n) space lower bound [Alon,Matias,Szegedy 96] p = 0 (distinct elements): O(log n + 1/ 2 ) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02] (log n + 1/ 2 ) space lower bound [Alon,Matias,Szegedy 96], [Indyk, Woodruff 03]
16
16 Stable Distributions D: distribution on R, x R n, p (0,2] The distribution D x : Z 1,…,Z n : i.i.d. random variables with distribution D D x = distribution of i x i Z i The distribution D p,x : Z: random variable with distribution D D p,x = distribution of ||x|| p Z Definition: D is p-stable, if for every x, D x = D p,x. Examples: p = 2: Standard normal distribution. p = 1: Cauchy distribution. Other p’s: no closed form pdf.
17
17 Indyk’s Algorithm For simplicity, assume p = 1. Input: a sequence X = (i 1,a 1 ),…,(i N,a N ) Output: a value z s.t. “Cauchy hash function”: h:[1,n] R h(1),…,h(n) are i.i.d. with Cauchy distribution In practice, use bounded precision
18
18 Indyk’s Algorithm, 1 st Attempt 1.k O(1/ 2 log(1/ )) 2.generate k Cauchy hash functions h 1,…,h k 3.for t = 1,…,k do 4. A t 0 5.for j = 1,…,N do 6. read (i j,a j ) from data stream 7. for t = 1,…,k do 8. A t A t + a j h t (i j ) 9.output median(A 1,…,A k )
19
19 Correctness Analysis Fix some t [1,k] What value does A t have at the end of the execution? Recall: h t (1),…,h t (n) are i.i.d. with 1-stable distribution Therefore, A t is distributed the same as: ||x|| 1 Z Z: random variable with Cauchy distribution
20
20 Correctness Analysis (cont.) Z 1,…,Z k : i.i.d. random variables with Cauchy distribution Output of algorithm: median(A 1,…,A k ) Same as: median(||x|| 1 Z 1,…,||x|| 1 Z k ) = ||x|| 1 median(Z 1,…,Z k ) Conclusion: enough to show:
21
21 Correctness Analysis (cont.) Claim: Let Z be distributed Cauchy. Then, Proof: The cdf of the Cauchy distribution is: Therefore, Claim: Let Z be distributed Cauchy. For any sufficiently small > 0,
22
22 Correctness Analysis (cont.) Claim: Let Z 1,…,Z k be k = O(1/ 2 log(1/ )) i.i.d. Cauchy random variables. Then, Proof: For j = 1,…,k, let Then, median(Z 1,…,Z k ) < 1 - iff j Y j ≥ k/2 E[ j Y j ] = k/2 - k/4 By Chernoff-Heoffding bound, Pr[ j Y j ≥ k/2] < /2 Similar analysis shows: Pr[median(Z 1,…,Z k ) > 1 + ] < /2
23
23 Space Analysis Space used: k = O(1/ 2 log(1/ )) times: A t :O(log m) bits h t :O(n log m) bits Too much! This time we really need h t (1),…,h t (n) to be totally independent Otherwise, resulting distribution is not stable Cannot use universal hashing What can we do?
24
24 Pseudo-Random Generators for Space-Bounded Computations [Nisan 90] Notation: U k = a random sequence of k bits An S-space R-random bits randomized algorithm A: Uses at most S bits of space Uses at most R random bits Accesses random bits sequentially A(x,U R ): (random) output of A on input x Nisan’s pseudo-random generator: G: {0,1} S log R {0,1} R s.t. For every S-space R-random bits randomized algorithm A, for every input x, A(x,U R ) has almost the same distribution as A(x,G(U S log R ))
25
25 Space Analysis Suppose input stream is guaranteed to come in the following order: First all pairs of the form (1,*) Then, all pairs of the form (2,*), …… Finally, all pairs of the form (n,*) Then, we can generate the values h t (1),…,h t (n) on the fly, and no need to store them O(log m) bits will suffice to store the hash function Therefore, for such input streams, Indyk’s algorithm uses: O(log m) bits of space O(n log m) random bits
26
26 Space Analysis (cont.) Conclusion: For “ordered” input streams, Indyk’s algorithm is an O(log m)-space O(n log m)-random bits randomized algorithm. Can use Nisan’s generator h t can now be generated from only O(log m log n) random bits Space needed: O(log n log m) bits Crucial observation: Indyk’s algorithm does not depend on the order of the input stream. Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.
27
27 Wrapping Up Space used: k = O(1/ 2 log(1/ )) times: A t :O(log m) bits h t :O(log n log m) bits (using Nisan’s generator) Total: O(1/ 2 log(1/ ) log n log m) bits
28
28 End of Lecture 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.