Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

CS 361A 2 Negative Result for Sampling Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any, E must have relative error with probability at least. Example –Say, r = n/5 –Error 20% with probability 1/2

CS 361A 3 Scenario Analysis Scenario A: –all values in X are identical (say V) –D(X) = 1 Scenario B: –distinct values in X are {V, W1, …, Wk}, –V appears n-k times –each Wi appears once –Wi’s are randomly distributed –D(X) = k+1

CS 361A 4 Proof Little Birdie – one of Scenarios A or B only Suppose –E examines elements X(1), X(2), …, X(r) in that order –choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] Why? –No information on whether Scenario A or B –Wi values are randomly distributed

CS 361A 5 Proof (continued) Define EV – event {X(1)=X(2)=…=X(r)=V} Last inequality because

CS 361A 6 Proof (conclusion) Choose to obtain Thus: –Scenario A  –Scenario B  Suppose –E returns estimate Z when EV happens –Scenario A  D(X)=1 –Scenario B  D(X)=k+1 –Z must have worst-case error >

A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. k Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. 0 1 k u-1 Pr(h(a)=k) Bit vector : 0000101010001001111 b k

Randomized Approximation (2) (based on [Indyk-Motwani 1998]) Algorithm SM – For fixed t, is D(X) >> t? –Choose hash function h: U  [1..t] –Initialize answer to NO –For each, if h( ) = t, set answer to YES Theorem: –If D(X) 0.25 –If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2

Analysis Let – Y be set of distinct elements of X SM(X) = NO no element of Y hashes to t P[element hashes to t] = 1/t Thus – P[SM(X) = NO] = Since |Y| = D(X), –If D(X) > 0.25 –If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 Observe – need 1 bit memory only!

Boosting Accuracy With 1 bit  can probabilistically distinguish D(X) 2t Running O(log 1/δ) instances in parallel  reduces error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2 Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

CS 361A 11 Sampling: Basics Idea: A small random sample S of the data often well-represents all the data –For a fast approx answer, apply the query to S & “scale” the result –E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 Red = in S R.a Est. count = 5*2 = 10, Exact count = 10 Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob  90%

Sampling versus Counting Observe –Count merely abstraction – need subsequent analytics –Data tuples – X merely one of many attributes –Databases – selection predicate, join results, … –Networking – need to combine distributed streams Single-pass Approaches –Good accuracy –But gives only a count -- cannot handle extensions Sampling-based Approaches –Keeps actual data – can address extensions –Strong negative result

Distinct Sampling for Streams Distinct Sampling for Streams [ Gibbons 2001] Best of both worlds –Good accuracy –Maintains “distinct sample” over stream –Handles distributed setting Basic idea –Hash – random “priority” for domain values –Tracks highest priority values seen –Random sample of tuples for each such value –Relative error with probability

Hash Function Domain U = [0..m-1] Hashing –Random A, B from U, with A>0 –g(x) = Ax + B (mod m) –h(x) – # leading 0s in binary representation of g(x) Clearly – Fact

Overall Idea Hash  random “level” for each domain value Compute level for stream elements Invariant –Current Level – cur_lev –Sample S – all distinct values scanned so far of level at least cur_lev Observe –Random hash  random sample of distinct values –For each value  can keep sample of their tuples

Algorithm DS (Distinct Sample) Parameters – memory size Initialize – cur_lev  0; S  empty For each input x –L  h(x) –If L>cur_lev then add x to S –If |S| > M delete from S all values of level cur_lev cur_lev  cur_lev +1 Return

Analysis Invariant – S contains all values x such that By construction Thus EXERCISE – verify deviation bound

CS 361A 18 Hot list queries Why is it interesting: –Top ten – best seller list –Load balancing –Caching policies

CS 361A 19 Hot list queries Let use sampling edoejddkaklsadkjdkdkpryekfvcuszldfoasd djkkdkvza k3d2jvza

CS 361A 20 Hot list queries The question is: –How to sample if we don’t know our sample size?

CS 361A 21 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0

CS 361A 22 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0 e Need to replace one value

CS 361A 23 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 0.75 e Multiply p with some amount f (f = 0.75) Throw biased coins with probability f 4302 Replace counts by number of seen heads

CS 361A 24 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 ed p = 0.75 e 4312 Replace a value which has zero count Count/p is an estimate of number of times a value has been seen. E.g., the value ‘a’ has been seen 4/p = 5.33 times

CS 361A 25 Counters How many bits need to count? –Prefix code –Approximated counters

CS 361A 26 Rarity Paul goes fishing. There are many different fish species U={1,..,u} Paul catch one fish at a time a t  U C t [j]=|{a i | a i =j,i≤t}| number of time catches the species j Species j is rare at time t if it appears only once  [t]=|{j| C t [j]=1}|/u

CS 361A 27 Rarity Why is it interesting?

CS 361A 28 Again lets use sampling U={1,2,3,4,5,6,7,8,9,10,11,12…u} U’={4,9,13,18,24} X t [i]=|{t|a j =U’[i],j≤t}|

CS 361A 29 Again lets use sampling X i [t]=|{t|a j =X i,j≤t}|  [t]=|{C t [i]| C t [i]=1}|/u  ’[t]=|{X t [i]| X t [i]=1}|/k תזכורת:

CS 361A 30 Rarity But  [t] need to be at least 1/k to get a good estimator.

CS 361A 31 Min-wise independent hash functions Family of hash functions H  [n]->[n] call Min-wise independent If for any X  [n] and x  X

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Similar presentations

Presentation on theme: "Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Similar presentations

Presentation on theme: "Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)"— Presentation transcript:

Similar presentations

About project

Feedback