CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Heavy hitter computation over data stream
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Evaluating Hypotheses
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Data Stream Algorithms Lower Bounds Graham Cormode
Sampling for Windows on Data Streams by Vladimir Braverman
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
Finding Frequent Items in Data Streams
Streaming & sampling.
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Range-Efficient Counting of Distinct Elements
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Approximation and Load Shedding Sampling Methods
Presentation transcript:

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

CS 361A 2 Game Plan for Week  Last Class Models for Streaming/Massive Data Sets Negative results for Exact Distinct Values Hashing for Approximate Distinct Values  Today Synopsis Data Structures Sampling Techniques Frequency Moments Problem Sketching Techniques Finding High-Frequency Items

CS 361A 3 Synopsis Data Structures  Synopses Webster – a condensed statement or outline (as of a narrative or treatise) CS 361A – succinct data structure that lets us answers queries efficiently  Synopsis Data Structures “Lossy” Summary (of a data stream) Advantages – fits in memory + easy to communicate Disadvantage – lossiness implies approximation error Negative Results  best we can do Key Techniques – randomization and hashing

CS 361A 4 Numerical Examples  Approximate Query Processing [AQUA/Bell Labs] Database Size – 420 MB Synopsis Size – 420 KB (0.1%) Approximation Error – within 10% Running Time – 0.3% of time for exact query  Histograms/Quantiles [Chaudhuri-Motwani-Narasayya, Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 10 9 items Synopsis Size – 1249 items Approximation Error – within 1%

CS 361A 5 Synopses  Desidarata Small Memory Footprint Quick Update and Query Provable, low-error guarantees Composable – for distributed scenario  Applicability? General-purpose – e.g. random samples Specific-purpose – e.g. distinct values estimator  Granularity? Per database – e.g. sample of entire table Per distinct value – e.g. customer profiles Structural – e.g. GROUP-BY or JOIN result samples

CS 361A 6 Examples of Synopses  Synopses need not be fancy! Simple Aggregates – e.g. mean/median/max/min Variance?  Random Samples Aggregates on small samples represent entire data Leverage extensive work on confidence intervals  Random Sketches structured samples  Tracking High-Frequency Items

CS 361A7 Random Samples

CS 361A 8 Types of Samples  Oblivious sampling – at item level oLimitations [Bar-Yossef–Kumar–Sivakumar STOC 01]  Value-based sampling – e.g. distinct-value samples  Structured samples – e.g. join sampling Naïve approach – keep samples of each relation Problem – sample-of-join ‡ join-of-samples Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99] what if A sampled from L and B from R? AABBAABB LR ABAB

CS 361A 9 Basic Scenario  Goal maintain uniform sample of item-stream  Sampling Semantics? Coin flip oselect each item with probability p oeasy to maintain oundesirable – sample size is unbounded Fixed-size sample without replacement oOur focus today Fixed-size sample with replacement oShow – can generate from previous sample  Non-Uniform Samples [ Chaudhuri-Motwani-Narasayya]

CS 361A 10 Reservoir Sampling [Vitter]  Input – stream of items X 1, X 2, X 3, …  Goal – maintain uniform random sample S of size n (without replacement) of stream so far  Reservoir Sampling Initialize – include first n elements in S Upon seeing item X t oAdd X t to S with probability n/t oIf added, evict random previous item

CS 361A 11 Analysis  Correctness? Fact: At each instant, |S| = n Theorem: At time t, any X i εS with probability n/t Exercise – prove via induction on t  Efficiency? Let N be stream size Remark: Verify this is optimal.  Naïve implementation  N coin flips  time O(N)

CS 361A 12 Improving Efficiency  Random variable J t – number jumped over after time t  Idea – generate J t and skip that many items  Cumulative Distribution Function – F(s) = P[J t ≤ s], for t>n & s≥0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 items inserted into sample S (where n=3) J 9 =4J 3 =2

CS 361A 13 Analysis  Number of calls to RANDOM()? one per insertion into sample this is optimal!  Generating J t ? Pick random number U ε [0,1] Find smallest j such that U ≤ F(j) How? oLinear scan  O(N) time oBinary search with Newton’s interpolation  O(n 2 (1 + polylog N/n)) time  Remark – see paper for optimal algorithm

CS 361A 14 Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]  Sliding Window W – last w items in stream  Model – item X t expires at time t+w  Why? Applications may require ignoring stale data Type of approximation Only way to define JOIN over streams  Goal – Maintain uniform sample of size n of sliding window

CS 361A 15 Reservoir Sampling?  Observe any item in sample S will expire eventually must replace with random item of current window  Problem no access to items in W-S storing entire window requires O(w) memory  Oversampling Backing sample B – select each item with probability sample S – select n items from B at random upon expiry in S  replenish from B Claim: n < |B| < n log w with high probability

CS 361A 16 Index-Set Approach  Pick random index set I= { i 1, …, i n }, X  {0,1, …, w-1}  Sample S – items X i with i ε {i 1, …, i n } (mod w) in current window  Example Suppose – w=2, n=1, and I={1} Then – sample is always X i with odd i  Memory – only O(k)  Observe S is uniform random sample of each window But sample is periodic (union of arithmetic progressions) Correlation across successive windows  Problems Correlation may hurt in some applications Some data (e.g. time-series) may be periodic

CS 361A 17 Chain-Sample Algorithm  Idea Fix expiry problem in Reservoir Sampling Advance planning for expiry of sampled items Focus on sample size 1 – keep n independent such samples  Chain-Sampling Add X t to S with probability 1/min{t,w} – evict earlier sample Initially – standard Reservoir Sampling up to time w Pre-select X t ’s replacement X r ε W t+w = {X t+1, …, X t+w } oX t expires  must replace from W t+w oAt time r, save X r and pre-select its own replacement  building “chain” of potential replacements Note – if evicting earlier sample, discard its “chain” as well

CS 361A 18 Example

CS 361A 19 Expectation for Chain-Sample  T(x) = E[chain length for X t at time t+x]  E[chain length] = T(w)  e   E[memory required for sample size n] = O(n)

CS 361A 20 Tail Bound for Chain-Sample  Chain = “hops” of total length at most w  Chain of h hops  ordered (h+1)-partition of w h hops of total length less than w plus, remainder  Each partition has probability w -h  Number of partitions:  h = O(log w)  probability of a partition is O(w -c )  Thus – memory O(n log w) with high probability

CS 361A 21 Comparison of Algorithms  Chain-Sample beats Oversample: Expected memory – O(n) vs O(n log w) High-probability memory bound – both O(n log w) Oversample may have sample size shrink below n! AlgorithmExpected High-Probability PeriodicO(n) OversampleO(n log w) Chain-SampleO(n)O(n log w)

CS 361A22 Sketches and Frequency Moments

CS 361A 23 Generalized Stream Model  Input Element (i,a) a copies of domain-value i increment to ith dimension of m by a a need not be an integer  Negative value – captures deletions Data stream: 2, 0, 1, 3, 1, 2, 4,... m 0 m 1 m 2 m 3 m

CS 361A 24 Example m 0 m 1 m 2 m 3 m On seeing element (i,a) = (2,2) m 0 m 1 m 2 m 3 m On seeing element (i,a) = (1,-1) m 0 m 1 m 2 m 3 m

CS 361A 25 Frequency Moments  Input Stream values from U = {0,1,…,N-1} frequency vector m = (m 0,m 1,…,m N-1 )  Kth Frequency Moment F k (m) = Σ i m i k F 0 : number of distinct values (Lecture 15) F 1 : stream size F 2 : Gini index, self-join size, Euclidean norm F k : for k>2, measures skew, sometimes useful F ∞ : maximum frequency  Problem – estimation in small space  Sketches – randomized estimators

CS 361A 26 Naive Approaches  Space N – counter m i for each distinct value i  Space O(1) if input sorted by i single counter recycled when new i value appears  Goal Allow arbitrary input Use small (logarithmic) space Settle for randomization/approximation

CS 361A 27 Sketching F 2  Random Hash h(i): {0,1,…,N-1}  {-1,1}  Define Z i =h(i)  Maintain X = Σ i m i Z i Easy for update streams (i,a) – just add aZ i to X  Claim: X 2 is unbiased estimator for F 2 Proof: E[X 2 ] = E[(Σ i m i Z i ) 2 ] = E[Σ i m i 2 Z i 2 ] + E[Σ i,j m i m j Z i Z j ] = Σ i m i 2 E[Z i 2 ] + Σ i,j m i m j E[Z i ]E[Z j ] = Σ i m i = F 2  Last Line? – Z i 2 = 1 and E[Z i ] = 0 as uniform{-1,1} from independence

CS 361A 28 Estimation Error?  Chebyshev bound:  Define Y = X 2  E[Y] = E[X 2 ] = Σ i m i 2 = F 2  Observe E[X 4 ] = E[(Σm i Z i ) 4 ] = E[Σm i 4 Z i 4 ]+4E[Σm i m j 3 Z i Z j 3 ]+6E[Σm i 2 m j 2 Z i 2 Z j 2 ] +12E[Σm i m j m k 2 Z i Z j Z k 2 ]+24E[Σm i m j m k m l Z i Z j Z k Z l ] = Σm i 4 + 6Σm i 2 m j 2  By definition Var[Y] = E[Y 2 ] – E[Y] 2 = E[X 4 ] – E[X 2 ] 2 = [Σm i 4 +6Σm i 2 m j 2 ] – [Σm i 4 +2Σm i 2 m j 2 ] = 4Σm i 2 m j 2 ≤ 2E[X 2 ] 2 = 2F 2 2 Why?

CS 361A 29 Estimation Error?  Chebyshev bound  P [relative estimation error >λ]  Problem – What if we want λ really small?  Solution Compute s = 8/ λ 2 independent copies of X Estimator Y = mean(X i 2 ) Variance reduces by factor s  P [relative estimation error >λ]

CS 361A 30 Boosting Technique  Algorithm A: Randomized λ-approximate estimator f P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4  Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]  Boosting Idea O(log1/ε) independent estimates from A(X) Return median of estimates  Claim: P[median is λ-approximate] >1- ε Proof: P[specific estimate is λ-approximate] = ¾ Bad event only if >50% estimates not λ-approximate Binomial tail – probability less than ε

CS 361A 31 Overall Space Requirement  Observe Let m = Σm i Each hash needs O(log m)-bit counter s = 8/ λ 2 hash functions for each estimator O(log 1/ε) such estimators  Total O( λ -2 log 1/ε log m) bits  Question – Space for storing hash function?

CS 361A 32 Sketching Paradigm  Random Sketch: inner product frequency vector m = (m 0,m 1,…,m N-1 ) random vector Z (currently, uniform {-1,1})  Observe Linearity  Sketch(m 1 ) ± Sketch(m 2 ) = Sketch (m 1 ± m 2 ) Ideal for distributed computing  Observe Suppose: Given i, can efficiently generate Z i Then: can maintain sketch for update streams Problem oMust generate Z i =h(i) on first appearance of i oNeed Ω(N) memory to store h explicitly oNeed Ω(N) random bits

CS 361A 33 Two birds, One stone  Pairwise Independent Z 1,Z 2, …, Z n for all Z i and Z k, P[Z i =x, Z k =y] = P[Z i =x].P[Z k =y] property E[Z i Z k ] = E[Z i ].E[Z k ]  Example – linear hash function Seed S= from [0..p-1], where p is prime Z i = h(i) = ai+b (mod p)  Claim: Z 1,Z 2, …, Z n are pairwise independent Z i =x and Z k =y  x=ai+b (mod p) and y=ak+b (mod p) fixing i, k, x, y  unique solution for a, b P[Z i =x, Z k =y] = 1/ p 2 = P[Z i =x].P[Z k =y]  Memory/Randomness: n log p  2 log p

CS 361A 34 Wait a minute!  Doesn’t pairwise independence screw up proofs?  No – E[X 2 ] calculation only has degree-2 terms  But – what about Var[X 2 ]?  Need 4-wise independence

CS 361A 35 Application – Join-Size Estimation  Given Join attribute frequencies f 1 and f 2 Join size = f 1.f 2  Define – X 1 = f 1.Z and X 2 = f 2.Z  Choose – Z as 4-wise independent & uniform {-1,1}  Exercise: Show, as before, E[X 1 X 2 ] = f 1.f 2 Var[X 1 X 2 ] ≤ 2 (f 1.f 2 ) 2 Hint: a.b ≤ |a|.|b|

CS 361A 36 Bounding Error Probability  Using s copies of X’s & taking their mean Y Pr[ |Y- f 1.f 2 | ≥ λ f 1.f 2 ] ≤ Var(Y) / λ 2 (f 1.f 2 ) 2 ≤ 2f 1 2 f 2 2 / sλ 2 (f 1.f 2 ) 2 = 2 / sλ 2 cos 2 θ  Bounding error probability? Need – s > 2/λ 2 cos 2 θ  Memory? – O( log 1/ε cos -2 θ λ -2 (log N + log m))  Problem To choose s – need a-priori lower bound on cos θ = f 1.f 2 What if cos θ really small?

CS 361A 37 Sketch Partitioning dom(R1.A) dom(R2.B) self-join(R1.A)*self-join(R2.B) = 205*205 = 42K self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K Idea for dealing with f 1 2 f 2 2 /(f 1.f 2 ) 2 issue -- partition domain into regions where self-join size is smaller to compensate small join-size (cos θ)

CS 361A 38 Sketch Partitioning  Idea intelligently partition join-attribute space need coarse statistics on stream build independent sketches for each partition  Estimate = Σ partition sketches  Variance = Σ partition variances

CS 361A 39 Sketch Partitioning  Partition Space Allocation? Can solve optimally, given domain partition  Optimal Partition: Find K-partition to minimize  Results Dynamic Programming – optimal solution for single join NP-hard – for queries with multiple joins

CS 361A 40 F k for k > 2  Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)  Choose – random stream item a p  p uniform from {1,2,…,m}  Suppose – a p = v ε {0,1,…,N-1}  Count subsequent frequency of v r = | {q | q≥p, a q =v} |  Define X = m(r k – (r-1) k )

CS 361A 41 Example  Stream 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8  m = 20  p = 9  a p = 5  r = 3

CS 361A 42 F k for k > 2  Var(X) ≤ kN 1 – 1/k F k 2  Bounded Error Probability  s = O(kN 1 – 1/k / λ 2 )  Boosting  memory bound O(kn 1 – 1/k λ -2 (log 1/ε)(log N + log m)) Summing over m choices of stream elements

CS 361A 43 Frequency Moments  F 0 – distinct values problem (Lecture 15)  F 1 – sequence length for case with deletions, use Cauchy distribution  F 2 – self-join size/Gini index (Today)  F k for k >2 omitting grungy details can achieve space bound O(kN 1 – 1/k λ -2 (log 1/ε)(log n + log m))  F ∞ – maximum frequency

CS 361A 44 Communication Complexity  Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power  Communication Complexity C(f) – bits exchanged by optimal protocol Π  Protocols? 1-way versus 2-way deterministic versus randomized  C δ (f) – randomized complexity for error probability δ ALICE input A BOB input B

CS 361A 45 Streaming & Communication Complexity  Stream Algorithm  1-way communication protocol  Simulation Argument Given – algorithm S computing f over streams Alice – initiates S, providing A as input stream prefix Communicates to Bob – S’s state after seeing A Bob – resumes S, providing B as input stream suffix  Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)

CS 361A 46 Example: Set Disjointness  Set Disjointness (DIS) A, B subsets of {1,2,…,N} Output  Theorem: C δ (DIS) = Ω(N), for any δ<1/2

CS 361A 47 Lower Bound for F ∞ Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with P[ (1-ε)F ∞ 1-δ needs Ω(N) space Proof Claim: S  1-way protocol for DIS (on any sets A and B) Alice streams set A to S Communicates S’s state to Bob Bob streams set B to S Observe Relative Error ε<1/3  DIS solved exactly! P[error <½ ] < δ  Ω(N) space

CS 361A 48 Extensions  Observe Used only 1-way communication in proof C δ (DIS) bound was for arbitrary communication Exercise – extend lower bound to multi-pass algorithms  Lower Bound for F k, k>2 Need to increase gap beyond 2 Multiparty Set Disjointness – t players Theorem: Fix ε,δ 5. Any stream algorithm S with P[ (1-ε)F k 1-δ needs Ω(N 1-(2+ δ)/k ) space Implies Ω(N 1/2 ) even for multi-pass algorithms

CS 361A49 Tracking High-Frequency Items

CS 361A 50 Problem 1 – Top-K List Problem 1 – Top-K List [Charikar-Chen-Farach-Colton] The Google Problem Return list of k most frequent items in stream Motivation search engine queries, network traffic, … Remember Saw lower bound recently! Solution Data structure Count-Sketch  maintaining count-estimates of high-frequency elements

CS 361A 51 Definitions  Notation Assume {1, 2, …, N} in order of frequency m i is frequency of i th most frequent element m = Σm i is number of elements in stream  FindCandidateTop Input: stream S, int k, int p Output: list of p elements containing top k Naive sampling gives solution with p =  (m log k / m k )  FindApproxTop Input: stream S, int k, real  Output: list of k elements, each of frequency m i > (1-  ) m k Naive sampling gives no solution

CS 361A 52 Main Idea  Consider single counter X hash function h(i): {1, 2,…,N}  {-1,+1}  Input element i  update counter X += Z i = h(i)  For each r, use XZ r as estimator of m r Theorem: E[XZ r ] = m r Proof X = Σ i m i Z i E[XZ r ] = E[Σ i m i Z i Z r ] = Σ i m i E[Z i Z r ] = m r E[Z r 2 ] = m r Cross-terms cancel

CS 361A 53 Finding Max Frequency Element  Problem – var[X] = F 2 = Σ i m i 2  Idea – t counters, independent 4-wise hashes h 1,…,h t  Use t = O(log m  m i 2 / (  m 1 ) 2 )  Claim: New Variance <  m i 2 / t = (  m 1 ) 2 / log m  Overall Estimator repeat + median of averages with high probability, approximate m 1 h 1 : i  {+1, –1} h t : i  {+1, –1}

CS 361A 54 Problem with “Array of Counters”  Variance – dominated by highest frequency  Estimates for less-frequent elements like k corrupted by higher frequencies variance >> m k  Avoiding Collisions? spread out high frequency elements replace each counter with hashtable of b counters

CS 361A 55 Count Sketch  Hash Functions 4-wise independent hashes h 1,...,h t and s 1,…,s t hashes independent of each other  Data structure: hashtables of counters X(r,c) s 1 : i  {1,..., b} h 1 : i  {+1, -1} s t : i  {1,..., b} h t : i  {+1, -1} 1 2 … b

CS 361A 56 Overall Algorithm  s r (i) – one of b counters in rth hashtable  Input i  for each r, update X(r,s r (i)) += h r (i)  Estimator(m i ) = median r { X(r,s r (i)) h r (i) }  Maintain heap of k top elements seen so far  Observe Not completely eliminated collision with high frequency items Few of estimates X(r,s r (i)) h r (i) could have high variance Median not sensitive to these poor estimates

CS 361A 57 Avoiding Large Items  b > O(k)  with probability Ω(1), no collision with top-k elements  t hashtables represent independent trials  Need log m/  trials to estimate with probability 1-   Also need – small variance for colliding small elements  Claim: P[variance due to small items in each estimate k m i 2 )/b] = Ω(1)  Final bound b = O(k +  i>k m i 2 / (  m k ) 2 )

CS 361A 58 Final Results  Zipfian Distribution: m i  1/i  [Power Law]  FindApproxTop [k + (  i>k m i 2 ) / (  m k ) 2 ] log m/  Roughly: sampling bound with frequencies squared Zipfian – gives improved results  FindCandidateTop Zipf parameter 0.5 O(k log N log m) Compare: sampling bound O((kN) 0.5 log k)

CS 361A 59 Problem 2 – Elephants-and-Ants Problem 2 – Elephants-and-Ants [Manku-Motwani]  Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream

CS 361A 60 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size w is function of support s – specify later… Window 1Window 2Window 3

CS 361A 61 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

CS 361A 62 Lossy Counting (continued) At window boundary, decrement all counters by 1 Next Window +

CS 361A 63 Error Analysis If current size of stream = N and window-size w = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

CS 361A 64 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need?  Worst case bound: 1/ε log εN counters  Implementation details…

CS 361A 65 Number of Counters?  Window size w = 1/   Number of windows m =  N  n i – # counters alive over last i windows  Fact:  Claim: Counter must average 1 increment/window to survive  # active counters

CS 361A 66 Enhancements Frequency Errors For counter (X, c), true frequency in [c, c+ ε N ] Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1] Batch Processing Decrements after k windows If (t = 1), no error!

CS 361A 67 Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate?

CS 361A 68 Sticky Sampling (continued) For finite stream of length N Sampling rate = 2/εN log 1/  s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N  = probability of failure

CS 361A 69 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/  s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/  log 1/  s

CS 361A 70 References – Synopses  Synopsis data structures for massive data sets. Gibbons and Matias, DIMACS Synopsis data structures for massive data sets  Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS Tracking Join and Self-Join Sizes in Limited Storage  Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD Join Synopses for Approximate Query Answering  Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD Random Sampling for Histogram Construction: How much is enough?  Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets  Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD Space-efficient online computation of quantile summaries

CS 361A 71 References – Sampling  Random Sampling with a Reservoir, Vitter. Transactions on Mathematical Software 11(1):37-57 (1985). Random Sampling with a Reservoir  On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999). On Sampling and Relational Operators.  On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD On Random Sampling over Joins.  Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD Congressional Samples for Approximate Answering of Group-By Queries  Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE Overcoming Limitations of Sampling for Aggregation Queries  A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01. A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries  Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA Sampling From a Moving Window Over Streaming Data.  Sampling algorithms: lower bounds and applications. Bar-Yossef– Kumar–Sivakumar. STOC Sampling algorithms: lower bounds and applications

CS 361A 72 References – Sketches  Probabilistic counting algorithms for data base applications. Flajolet and Martin. JCSS (1985). Probabilistic counting algorithms for data base applications.  The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC The space complexity of approximating the frequency moments.  Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB Approximate Frequency Counts over Streaming Data  Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP Finding Frequent Items in Data Streams.  An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS An Approximate L1-Difference Algorithm for Massive Data Streams  Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation