Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Quantum Lower Bound for the Collision Problem Scott Aaronson 1/10/2002 quant-ph/ I was born at the Big Bang. Cool! We have the same birthday.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Mining Data Streams.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
BTrees & Bitmap Indexes
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,
The Goldreich-Levin Theorem: List-decoding the Hadamard code
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 15 (Nov 14, 2005) Hashing for Massive/Streaming Data Rajeev Motwani.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Algorithms for massive data sets Lecture 1 (Feb 16, 2003) Yossi Matias and Ely Porat (partially based on various presentations & notes)
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
1 Fingerprinting techniques. 2 Is X equal to Y? = ? = ?
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Jessie Zhao Course page: 1.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
BY Lecturer: Aisha Dawood. The hiring problem:  You are using an employment agency to hire a new office assistant.  The agency sends you one candidate.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
AP Statistics Section 11.1 B More on Significance Tests.
CS4432: Database Systems II Query Processing- Part 2.
ICS 353: Design and Analysis of Algorithms
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
R ANDOM N UMBER G ENERATORS Modeling and Simulation CS
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Counting How Many Elements Computing “Moments”
Spatial Online Sampling and Aggregation
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
AQUA: Approximate Query Answering
CS 154, Lecture 6: Communication Complexity
Rank Aggregation.
CSCI B609: “Foundations of Data Science”
Introduction to Stream Computing and Reservoir Sampling
CS210- Lecture 17 July 12, 2005 Agenda Collision Handling
Presentation transcript:

Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)

CS 361A 2 Negative Result for Sampling Negative Result for Sampling [ Charikar, Chaudhuri, Motwani, Narasayya 2000] Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any, E must have relative error with probability at least. Example –Say, r = n/5 –Error 20% with probability 1/2

CS 361A 3 Scenario Analysis Scenario A: –all values in X are identical (say V) –D(X) = 1 Scenario B: –distinct values in X are {V, W1, …, Wk}, –V appears n-k times –each Wi appears once –Wi’s are randomly distributed –D(X) = k+1

CS 361A 4 Proof Little Birdie – one of Scenarios A or B only Suppose –E examines elements X(1), X(2), …, X(r) in that order –choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1) Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ] Why? –No information on whether Scenario A or B –Wi values are randomly distributed

CS 361A 5 Proof (continued) Define EV – event {X(1)=X(2)=…=X(r)=V} Last inequality because

CS 361A 6 Proof (conclusion) Choose to obtain Thus: –Scenario A  –Scenario B  Suppose –E returns estimate Z when EV happens –Scenario A  D(X)=1 –Scenario B  D(X)=k+1 –Z must have worst-case error >

A bit vector BV will represent the set Let b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F. For a in A, let h(a) = r ·a + s = 101****10….0 Set k’th bit. Estimate is 2^{max bit set}. k Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996]) Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c. 0 1 k u-1 Pr(h(a)=k) Bit vector : b k

Randomized Approximation (2) (based on [Indyk-Motwani 1998]) Algorithm SM – For fixed t, is D(X) >> t? –Choose hash function h: U  [1..t] –Initialize answer to NO –For each, if h( ) = t, set answer to YES Theorem: –If D(X) 0.25 –If D(X) > 2t, P[SM outputs NO] < = 1/e^2

Analysis Let – Y be set of distinct elements of X SM(X) = NO no element of Y hashes to t P[element hashes to t] = 1/t Thus – P[SM(X) = NO] = Since |Y| = D(X), –If D(X) > 0.25 –If D(X) > 2t, P[SM(X) = NO] < < 1/e^2 Observe – need 1 bit memory only!

Boosting Accuracy With 1 bit  can probabilistically distinguish D(X) 2t Running O(log 1/δ) instances in parallel  reduces error probability to any δ>0 Running O(log n) in parallel for t = 1, 2, 4, 8 …, n  can estimate D(X) within factor 2 Choice of factor 2 is arbitrary  can use factor (1+ε) to reduce error to ε EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space

CS 361A 11 Sampling: Basics Idea: A small random sample S of the data often well-represents all the data –For a fast approx answer, apply the query to S & “scale” the result –E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = Red = in S R.a Est. count = 5*2 = 10, Exact count = 10 Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob  90%

Sampling versus Counting Observe –Count merely abstraction – need subsequent analytics –Data tuples – X merely one of many attributes –Databases – selection predicate, join results, … –Networking – need to combine distributed streams Single-pass Approaches –Good accuracy –But gives only a count -- cannot handle extensions Sampling-based Approaches –Keeps actual data – can address extensions –Strong negative result

Distinct Sampling for Streams Distinct Sampling for Streams [ Gibbons 2001] Best of both worlds –Good accuracy –Maintains “distinct sample” over stream –Handles distributed setting Basic idea –Hash – random “priority” for domain values –Tracks highest priority values seen –Random sample of tuples for each such value –Relative error with probability

Hash Function Domain U = [0..m-1] Hashing –Random A, B from U, with A>0 –g(x) = Ax + B (mod m) –h(x) – # leading 0s in binary representation of g(x) Clearly – Fact

Overall Idea Hash  random “level” for each domain value Compute level for stream elements Invariant –Current Level – cur_lev –Sample S – all distinct values scanned so far of level at least cur_lev Observe –Random hash  random sample of distinct values –For each value  can keep sample of their tuples

Algorithm DS (Distinct Sample) Parameters – memory size Initialize – cur_lev  0; S  empty For each input x –L  h(x) –If L>cur_lev then add x to S –If |S| > M delete from S all values of level cur_lev cur_lev  cur_lev +1 Return

Analysis Invariant – S contains all values x such that By construction Thus EXERCISE – verify deviation bound

CS 361A 18 Hot list queries Why is it interesting: –Top ten – best seller list –Load balancing –Caching policies

CS 361A 19 Hot list queries Let use sampling edoejddkaklsadkjdkdkpryekfvcuszldfoasd djkkdkvza k3d2jvza

CS 361A 20 Hot list queries The question is: –How to sample if we don’t know our sample size?

CS 361A 21 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0

CS 361A 22 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 1.0 e Need to replace one value

CS 361A 23 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 cd p = 0.75 e Multiply p with some amount f (f = 0.75) Throw biased coins with probability f 4302 Replace counts by number of seen heads

CS 361A 24 Gibbons & Matias’ algorithm 0000 a a 1 Hotlist: b b 1 a 2 Produced values: c a a b d b a d d 5313 ed p = 0.75 e 4312 Replace a value which has zero count Count/p is an estimate of number of times a value has been seen. E.g., the value ‘a’ has been seen 4/p = 5.33 times

CS 361A 25 Counters How many bits need to count? –Prefix code –Approximated counters

CS 361A 26 Rarity Paul goes fishing. There are many different fish species U={1,..,u} Paul catch one fish at a time a t  U C t [j]=|{a i | a i =j,i≤t}| number of time catches the species j Species j is rare at time t if it appears only once  [t]=|{j| C t [j]=1}|/u

CS 361A 27 Rarity Why is it interesting?

CS 361A 28 Again lets use sampling U={1,2,3,4,5,6,7,8,9,10,11,12…u} U’={4,9,13,18,24} X t [i]=|{t|a j =U’[i],j≤t}|

CS 361A 29 Again lets use sampling X i [t]=|{t|a j =X i,j≤t}|  [t]=|{C t [i]| C t [i]=1}|/u  ’[t]=|{X t [i]| X t [i]=1}|/k תזכורת:

CS 361A 30 Rarity But  [t] need to be at least 1/k to get a good estimator.

CS 361A 31 Min-wise independent hash functions Family of hash functions H  [n]->[n] call Min-wise independent If for any X  [n] and x  X