Download presentation
Presentation is loading. Please wait.
1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006 http://www.ee.technion.ac.il/courses/049011
2
2 Data Streams
3
3 Outline The data stream model Approximate counting Distinct elements Frequency moments
4
4 The Data Stream Model f: A n B A,B arbitrary sets n: positive integer (think of n as large) Given x 2 A n, each entry x i is called an “element”. Typically, A,B are “small” (constant size) sets Goal: given x A n, compute f(x) Frequently, approximation of f(x) suffices Usually, will use randomization Streaming access to input Algorithm reads input in “sequential passes” In each pass x is read in the following order: x 1,x 2,…,x n Impossible: random access, go backwards Possible: store portions of x (or other functions of x) in memory
5
5 Complexity Measures Space Objective: use as little memory as possible Note: if we allow unlimited space, data stream model is the same as the standard RAM model Ideally, up to O(log c n) for some constant c Number of passes Objective: use as few passes as possible Ideally, only a single pass Usually, no more than a constant number of passes Running time Objective: use as little time as possible Ideally, up to O(n log c n) for some constant c
6
6 Motivation Types of large data sets: Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,… Generated on the fly Data feeds, streaming media, packet streams,… Access to large data sets: Random access: costly (if data is pre-stored) infeasible (if data is generated on the fly) Streaming access: the only feasible option Resources: Memory: the primary bottleneck Number of passes: a few (if data is pre-stored) single (if data is generated on the fly) Time: cannot be more than quasi-linear
7
7 Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x {0,1} n Goal: find H = number of 1’s in x Naïve solution: just count them! O(log H) bits of space Can we do better? Answer 1: No! Information theory implies an (log H) lower bound Answer 2: Yes! But only approximately: Output closest power of 1+ to H Note: # possible outputs is O(log 1+ H) = O(1/ log H) Hence, only O(log log H + log(1/ )) bits of space suffice
8
8 Approximate Counting ( = 1) k 0 for i = 1 to n do if x i = 1, then with probability 1/2 k, set k k + 1 output 2 k - 1 General idea: Expected # of 1’s needed to increment k to k + 1 is 2 k k = 0 k = 1: after seeing 1 one k = 1 k = 2: after seeing 2 additional 1’s k = 2 k = 3: after seeing 4 additional 1’s …… k = i-1 k = i: after seeing 2 i-1 additional 1’s Therefore, we expect k to become i after seeing 1 + 2 + 4 + … + 2 i-1 = 2 i – 1 1’s
9
9 Approximate Counting: Analysis For m = 0,…,H, let: K m = value of counter after seeing m 1’s. For i = 0,…,m, let p m,i = Pr[K m = i] Recursion: p 0,0 = 1 p m,0 = 0, for m = 1,…,H p m,i = p m-1,i (1 – 1/2 i ) + p m-1,i-1 1/2 i-1, for m = 1,…,H, i = 1,…,m-1 p m,m = p m-1,m-1 1/2 m-1, for m = 1,…,H
10
10 Approximate Counting: Analysis Define:C m = 2 K m Lemma: E[C m ] = m + 1 Therefore, C H - 1 is an unbiased estimator for H Can show that Var[C H ] is small, and therefore w.h.p. H/2 ≤ C H – 1 ≤ 2H. Proof of lemma: By induction on m. Basis: E[C 0 ] = 1, E[C 1 ] = 2. Suppose m ≥ 2 and E[C m-1 ] = m.
11
11 Approximate Counting: Analysis
12
12 Better Approximation So far, factor 2 approximation How do we obtain a 1+ approximation? k 0 for i = 1 to n do if x i = 1, then with probability 1/(1 + ) k, set k k + 1 output ((1 + ) k – 1)/
13
13 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x {1,2,…,m} n Goal: find D = number of distinct elements of x Example: if x = (1,2,3,1,2,3), then D = 3 Naïve solution: use a bit vector of size m, and track the values that appeared at least once O(m) bits of space Can we do better? Answer 1: No! If we want exact number, need (m) bits of space Information theory gives only (log m) Need communication complexity arguments Answer 2: Yes! But only approximately: Use only O(log m) bits of space
14
14 Estimating the Size of a Random Set Suppose we choose D << M 1/2 elements uniformly and independently from {1,…,M}: X 1 is uniformly chosen from {1,…,M} X 2 is uniformly chosen from {1,…,M} …… X D is uniformly chosen from {1,…,M} For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X 1,…,X k }? k = 1, we expect M/2 elements to be less than X 1 k = 2, we expect M/3 elements to be less than min{X 1,X 2 } k = 3, we expect M/4 elements to be less than min{X 1,X 2,X 3 } …… k = D, we expect M/(D+1) elements to be less than min{X 1,…,X D } Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown Then, if t = min S, we can estimate |S| as M/t – 1.
15
15 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a random “hash function” h: {1,…,m} {1,…,M}, h(1),…,h(m) are chosen uniformly and independently from {1,…,M} Since M >> m 2, probability of collisions is tiny min M for i = 1 to n do if h(x i ) < min, min h(x i ) output M/min
16
16 Distinct Elements: Analysis Space: O(log M) = O(log m) Not quite. We’ll discuss this later. Correctness: Let a 1,…,a D be the distinct values of x 1,…,x n S = { h(a 1 ),…,h(a D ) } is a set of D random and independent elements from { 1,…,M } Note: min = min S Algorithm outputs M/(min S) Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.
17
17 Distinct Elements: Correctness Part 1: show Define for k = 1,…,D: Define: Note:
18
18 Markov’s Inequality X 0: a non-negative random variable t > 1 Then: Need to show: By Markov’s inequality,
19
19 Distinct Elements: Correctness Part 2: show Define for k = 1,…,D: Define: Note:
20
20 Chebyshev’s Inequality X: an arbitrary random variable > 0 Then: Need to show: By Chebyshev’s inequality, By independence of Y 1,…,Y D : Hence,
21
21 How to Store the Hash Function? How many bits needed to represent a random hash function h: [m] [M]? O(m log M) = O(m log m) bits More than the naïve algorithm! Solution: use “small” families of hash functions H will be a family of functions h: [m] [M] |H| = O(m c ) for some constant c Each h H, can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. How do we make sure H has the “random-like” properties of totally random hash functions?
22
22 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if: For all x y [m] and for all z,w [M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions: For each x, h(x) is uniform in [M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise- independent random variables k-universal families: straightforward generalization
23
23 Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field F m Each two elements a,b F m will define one hash function in H |H| = |F m | 2 = m 2 h a,b (x) = ax + b (operations in F m ) Note: if x y [m] and z,w [m], then h a,b (x) = z and h a,b (y) = w iff Since x y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m 2.
24
24 Distinct Elements, 2 nd Attempt Use a random hash function from a 2-universal family of hash functions rather than a totally random hash function Space: O(log m) for tracking the minimum O(log m) for storing the hash function Correctness: Part 1: h(a 1 ),…,h(a D ) are still uniform in [M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not. Part 2: h(a 1 ),…,h(a D ) are still uniform in [M] Main point: variance of pairwise independent variables is additive:
25
25 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm: Find the t = O(1/ 2 ) smallest elements, rather than just the smallest one. If v is the largest among these, output tM/v Space: O(1/ 2 log m) Better algorithm: O(1/ 2 + log m)
26
26 Frequency Moments [Alon, Matias, Szegedy 96] Input: a vector x {1,2,…,m} n Goal: find F k = k-th frequency moment of x For each j {1,…,m}, f j = # of occurrences of j in x Ex: if x = (1,1,1,2,2,3) then f 1 = 3, f 2 = 2, f 3 = 1 Examples F 1 = n (counting) F 0 = distinct elements F 2 = measure of “pairwise collisions” F k = measure of “k-wise collisions”
27
27 Frequency Moments: Data Stream Algorithms F 0 : O(1/ 2 + log m) F 1 : O(log log n + log(1/ )) F 2 : O(1/ 2 (log m + log n)) F k, k > 2: O(1/ 2 m 1-2/k )
28
28 End of Lecture 12
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.