Download presentation
Presentation is loading. Please wait.
1
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
2
CS 361A 2 Negative Result for Sampling Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any, E must have relative error with probability at least. Example –Say, r = n/5 –Error 20% with probability 1/2
3
CS 361A 3 Scenario Analysis Scenario A: –all values in X are identical (say V) –D(X) = 1 Scenario B: –distinct values in X are {V, W1, …, Wk}, –V appears n-k times –each Wi appears once –Wi’s are randomly distributed –D(X) = k+1
4
CS 361A 4 Proof Little Birdie – one of Scenarios A or B only Suppose –E examines elements X(1), X(2), …, X(r) in that order –choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] Why? –No information on whether Scenario A or B –Wi values are randomly distributed
5
CS 361A 5 Proof (continued) Define EV – event {X(1)=X(2)=…=X(r)=V} Last inequality because
6
CS 361A 6 Proof (conclusion) Choose to obtain Thus: –Scenario A –Scenario B Suppose –E returns estimate Z when EV happens –Scenario A D(X)=1 –Scenario B D(X)=k+1 –Z must have worst-case error >
7
A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. k Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. 0 1 k u-1 Pr(h(a)=k) Bit vector : 0000101010001001111 b k
8
Randomized Approximation (2) (based on [Indyk-Motwani 1998]) Algorithm SM – For fixed t, is D(X) >> t? –Choose hash function h: U [1..t] –Initialize answer to NO –For each, if h( ) = t, set answer to YES Theorem: –If D(X) 0.25 –If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2
9
Analysis Let – Y be set of distinct elements of X SM(X) = NO no element of Y hashes to t P[element hashes to t] = 1/t Thus – P[SM(X) = NO] = Since |Y| = D(X), –If D(X) > 0.25 –If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 Observe – need 1 bit memory only!
10
Boosting Accuracy With 1 bit can probabilistically distinguish D(X) 2t Running O(log 1/δ) instances in parallel reduces error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8 …, n can estimate D(X) within factor 2 Choice of factor 2 is arbitrary can use factor (1+ε) to reduce error to ε EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space
11
CS 361A 11 Sampling: Basics Idea: A small random sample S of the data often well-represents all the data –For a fast approx answer, apply the query to S & “scale” the result –E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 Red = in S R.a Est. count = 5*2 = 10, Exact count = 10 Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob 90%
12
Sampling versus Counting Observe –Count merely abstraction – need subsequent analytics –Data tuples – X merely one of many attributes –Databases – selection predicate, join results, … –Networking – need to combine distributed streams Single-pass Approaches –Good accuracy –But gives only a count -- cannot handle extensions Sampling-based Approaches –Keeps actual data – can address extensions –Strong negative result
13
Distinct Sampling for Streams Distinct Sampling for Streams [ Gibbons 2001] Best of both worlds –Good accuracy –Maintains “distinct sample” over stream –Handles distributed setting Basic idea –Hash – random “priority” for domain values –Tracks highest priority values seen –Random sample of tuples for each such value –Relative error with probability
14
Hash Function Domain U = [0..m-1] Hashing –Random A, B from U, with A>0 –g(x) = Ax + B (mod m) –h(x) – # leading 0s in binary representation of g(x) Clearly – Fact
15
Overall Idea Hash random “level” for each domain value Compute level for stream elements Invariant –Current Level – cur_lev –Sample S – all distinct values scanned so far of level at least cur_lev Observe –Random hash random sample of distinct values –For each value can keep sample of their tuples
16
Algorithm DS (Distinct Sample) Parameters – memory size Initialize – cur_lev 0; S empty For each input x –L h(x) –If L>cur_lev then add x to S –If |S| > M delete from S all values of level cur_lev cur_lev cur_lev +1 Return
17
Analysis Invariant – S contains all values x such that By construction Thus EXERCISE – verify deviation bound
18
CS 361A 18 Hot list queries Why is it interesting: –Top ten – best seller list –Load balancing –Caching policies
19
CS 361A 19 Hot list queries Let use sampling edoejddkaklsadkjdkdkpryekfvcuszldfoasd djkkdkvza k3d2jvza
20
CS 361A 20 Hot list queries The question is: –How to sample if we don’t know our sample size?
21
CS 361A 21 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0
22
CS 361A 22 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0 e Need to replace one value
23
CS 361A 23 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 0.75 e Multiply p with some amount f (f = 0.75) Throw biased coins with probability f 4302 Replace counts by number of seen heads
24
CS 361A 24 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 ed p = 0.75 e 4312 Replace a value which has zero count Count/p is an estimate of number of times a value has been seen. E.g., the value ‘a’ has been seen 4/p = 5.33 times
25
CS 361A 25 Counters How many bits need to count? –Prefix code –Approximated counters
26
CS 361A 26 Rarity Paul goes fishing. There are many different fish species U={1,..,u} Paul catch one fish at a time a t U C t [j]=|{a i | a i =j,i≤t}| number of time catches the species j Species j is rare at time t if it appears only once [t]=|{j| C t [j]=1}|/u
27
CS 361A 27 Rarity Why is it interesting?
28
CS 361A 28 Again lets use sampling U={1,2,3,4,5,6,7,8,9,10,11,12…u} U’={4,9,13,18,24} X t [i]=|{t|a j =U’[i],j≤t}|
29
CS 361A 29 Again lets use sampling X i [t]=|{t|a j =X i,j≤t}| [t]=|{C t [i]| C t [i]=1}|/u ’[t]=|{X t [i]| X t [i]=1}|/k תזכורת:
30
CS 361A 30 Rarity But [t] need to be at least 1/k to get a good estimator.
31
CS 361A 31 Min-wise independent hash functions Family of hash functions H [n]->[n] call Min-wise independent If for any X [n] and x X
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.