CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.

2 CS 361A 2 Game Plan for Week  Last Class Models for Streaming/Massive Data Sets Negative results for Exact Distinct Values Hashing for Approximate Distinct Values  Today Synopsis Data Structures Sampling Techniques Frequency Moments Problem Sketching Techniques Finding High-Frequency Items

3 CS 361A 3 Synopsis Data Structures  Synopses Webster – a condensed statement or outline (as of a narrative or treatise) CS 361A – succinct data structure that lets us answers queries efficiently  Synopsis Data Structures “Lossy” Summary (of a data stream) Advantages – fits in memory + easy to communicate Disadvantage – lossiness implies approximation error Negative Results  best we can do Key Techniques – randomization and hashing

4 CS 361A 4 Numerical Examples  Approximate Query Processing [AQUA/Bell Labs] Database Size – 420 MB Synopsis Size – 420 KB (0.1%) Approximation Error – within 10% Running Time – 0.3% of time for exact query  Histograms/Quantiles [Chaudhuri-Motwani-Narasayya, Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 10 9 items Synopsis Size – 1249 items Approximation Error – within 1%

5 CS 361A 5 Synopses  Desidarata Small Memory Footprint Quick Update and Query Provable, low-error guarantees Composable – for distributed scenario  Applicability? General-purpose – e.g. random samples Specific-purpose – e.g. distinct values estimator  Granularity? Per database – e.g. sample of entire table Per distinct value – e.g. customer profiles Structural – e.g. GROUP-BY or JOIN result samples

6 CS 361A 6 Examples of Synopses  Synopses need not be fancy! Simple Aggregates – e.g. mean/median/max/min Variance?  Random Samples Aggregates on small samples represent entire data Leverage extensive work on confidence intervals  Random Sketches structured samples  Tracking High-Frequency Items

7 CS 361A7 Random Samples

8 CS 361A 8 Types of Samples  Oblivious sampling – at item level oLimitations [Bar-Yossef–Kumar–Sivakumar STOC 01]  Value-based sampling – e.g. distinct-value samples  Structured samples – e.g. join sampling Naïve approach – keep samples of each relation Problem – sample-of-join ‡ join-of-samples Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99] what if A sampled from L and B from R? AABBAABB LR ABAB

9 CS 361A 9 Basic Scenario  Goal maintain uniform sample of item-stream  Sampling Semantics? Coin flip oselect each item with probability p oeasy to maintain oundesirable – sample size is unbounded Fixed-size sample without replacement oOur focus today Fixed-size sample with replacement oShow – can generate from previous sample  Non-Uniform Samples [ Chaudhuri-Motwani-Narasayya]

10 CS 361A 10 Reservoir Sampling [Vitter]  Input – stream of items X 1, X 2, X 3, …  Goal – maintain uniform random sample S of size n (without replacement) of stream so far  Reservoir Sampling Initialize – include first n elements in S Upon seeing item X t oAdd X t to S with probability n/t oIf added, evict random previous item

11 CS 361A 11 Analysis  Correctness? Fact: At each instant, |S| = n Theorem: At time t, any X i εS with probability n/t Exercise – prove via induction on t  Efficiency? Let N be stream size Remark: Verify this is optimal.  Naïve implementation  N coin flips  time O(N)

12 CS 361A 12 Improving Efficiency  Random variable J t – number jumped over after time t  Idea – generate J t and skip that many items  Cumulative Distribution Function – F(s) = P[J t ≤ s], for t>n & s≥0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 items inserted into sample S (where n=3) J 9 =4J 3 =2

13 CS 361A 13 Analysis  Number of calls to RANDOM()? one per insertion into sample this is optimal!  Generating J t ? Pick random number U ε [0,1] Find smallest j such that U ≤ F(j) How? oLinear scan  O(N) time oBinary search with Newton’s interpolation  O(n 2 (1 + polylog N/n)) time  Remark – see paper for optimal algorithm

14 CS 361A 14 Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]  Sliding Window W – last w items in stream  Model – item X t expires at time t+w  Why? Applications may require ignoring stale data Type of approximation Only way to define JOIN over streams  Goal – Maintain uniform sample of size n of sliding window

15 CS 361A 15 Reservoir Sampling?  Observe any item in sample S will expire eventually must replace with random item of current window  Problem no access to items in W-S storing entire window requires O(w) memory  Oversampling Backing sample B – select each item with probability sample S – select n items from B at random upon expiry in S  replenish from B Claim: n < |B| < n log w with high probability

16 CS 361A 16 Index-Set Approach  Pick random index set I= { i 1, …, i n }, X  {0,1, …, w-1}  Sample S – items X i with i ε {i 1, …, i n } (mod w) in current window  Example Suppose – w=2, n=1, and I={1} Then – sample is always X i with odd i  Memory – only O(k)  Observe S is uniform random sample of each window But sample is periodic (union of arithmetic progressions) Correlation across successive windows  Problems Correlation may hurt in some applications Some data (e.g. time-series) may be periodic

17 CS 361A 17 Chain-Sample Algorithm  Idea Fix expiry problem in Reservoir Sampling Advance planning for expiry of sampled items Focus on sample size 1 – keep n independent such samples  Chain-Sampling Add X t to S with probability 1/min{t,w} – evict earlier sample Initially – standard Reservoir Sampling up to time w Pre-select X t ’s replacement X r ε W t+w = {X t+1, …, X t+w } oX t expires  must replace from W t+w oAt time r, save X r and pre-select its own replacement  building “chain” of potential replacements Note – if evicting earlier sample, discard its “chain” as well

18 CS 361A 18 Example 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

19 CS 361A 19 Expectation for Chain-Sample  T(x) = E[chain length for X t at time t+x]  E[chain length] = T(w)  e  2.718  E[memory required for sample size n] = O(n)

20 CS 361A 20 Tail Bound for Chain-Sample  Chain = “hops” of total length at most w  Chain of h hops  ordered (h+1)-partition of w h hops of total length less than w plus, remainder  Each partition has probability w -h  Number of partitions:  h = O(log w)  probability of a partition is O(w -c )  Thus – memory O(n log w) with high probability

21 CS 361A 21 Comparison of Algorithms  Chain-Sample beats Oversample: Expected memory – O(n) vs O(n log w) High-probability memory bound – both O(n log w) Oversample may have sample size shrink below n! AlgorithmExpected High-Probability PeriodicO(n) OversampleO(n log w) Chain-SampleO(n)O(n log w)

22 CS 361A22 Sketches and Frequency Moments

23 CS 361A 23 Generalized Stream Model  Input Element (i,a) a copies of domain-value i increment to ith dimension of m by a a need not be an integer  Negative value – captures deletions Data stream: 2, 0, 1, 3, 1, 2, 4,... m 0 m 1 m 2 m 3 m 4 111 22

24 CS 361A 24 Example m 0 m 1 m 2 m 3 m 4 111 22 On seeing element (i,a) = (2,2) m 0 m 1 m 2 m 3 m 4 111 2 4 On seeing element (i,a) = (1,-1) m 0 m 1 m 2 m 3 m 4 1 1 1 4 1

25 CS 361A 25 Frequency Moments  Input Stream values from U = {0,1,…,N-1} frequency vector m = (m 0,m 1,…,m N-1 )  Kth Frequency Moment F k (m) = Σ i m i k F 0 : number of distinct values (Lecture 15) F 1 : stream size F 2 : Gini index, self-join size, Euclidean norm F k : for k>2, measures skew, sometimes useful F ∞ : maximum frequency  Problem – estimation in small space  Sketches – randomized estimators

26 CS 361A 26 Naive Approaches  Space N – counter m i for each distinct value i  Space O(1) if input sorted by i single counter recycled when new i value appears  Goal Allow arbitrary input Use small (logarithmic) space Settle for randomization/approximation

27 CS 361A 27 Sketching F 2  Random Hash h(i): {0,1,…,N-1}  {-1,1}  Define Z i =h(i)  Maintain X = Σ i m i Z i Easy for update streams (i,a) – just add aZ i to X  Claim: X 2 is unbiased estimator for F 2 Proof: E[X 2 ] = E[(Σ i m i Z i ) 2 ] = E[Σ i m i 2 Z i 2 ] + E[Σ i,j m i m j Z i Z j ] = Σ i m i 2 E[Z i 2 ] + Σ i,j m i m j E[Z i ]E[Z j ] = Σ i m i 2 + 0 = F 2  Last Line? – Z i 2 = 1 and E[Z i ] = 0 as uniform{-1,1} from independence

28 CS 361A 28 Estimation Error?  Chebyshev bound:  Define Y = X 2  E[Y] = E[X 2 ] = Σ i m i 2 = F 2  Observe E[X 4 ] = E[(Σm i Z i ) 4 ] = E[Σm i 4 Z i 4 ]+4E[Σm i m j 3 Z i Z j 3 ]+6E[Σm i 2 m j 2 Z i 2 Z j 2 ] +12E[Σm i m j m k 2 Z i Z j Z k 2 ]+24E[Σm i m j m k m l Z i Z j Z k Z l ] = Σm i 4 + 6Σm i 2 m j 2  By definition Var[Y] = E[Y 2 ] – E[Y] 2 = E[X 4 ] – E[X 2 ] 2 = [Σm i 4 +6Σm i 2 m j 2 ] – [Σm i 4 +2Σm i 2 m j 2 ] = 4Σm i 2 m j 2 ≤ 2E[X 2 ] 2 = 2F 2 2 Why?

29 CS 361A 29 Estimation Error?  Chebyshev bound  P [relative estimation error >λ]  Problem – What if we want λ really small?  Solution Compute s = 8/ λ 2 independent copies of X Estimator Y = mean(X i 2 ) Variance reduces by factor s  P [relative estimation error >λ]

30 CS 361A 30 Boosting Technique  Algorithm A: Randomized λ-approximate estimator f P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4  Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]  Boosting Idea O(log1/ε) independent estimates from A(X) Return median of estimates  Claim: P[median is λ-approximate] >1- ε Proof: P[specific estimate is λ-approximate] = ¾ Bad event only if >50% estimates not λ-approximate Binomial tail – probability less than ε

31 CS 361A 31 Overall Space Requirement  Observe Let m = Σm i Each hash needs O(log m)-bit counter s = 8/ λ 2 hash functions for each estimator O(log 1/ε) such estimators  Total O( λ -2 log 1/ε log m) bits  Question – Space for storing hash function?

32 CS 361A 32 Sketching Paradigm  Random Sketch: inner product frequency vector m = (m 0,m 1,…,m N-1 ) random vector Z (currently, uniform {-1,1})  Observe Linearity  Sketch(m 1 ) ± Sketch(m 2 ) = Sketch (m 1 ± m 2 ) Ideal for distributed computing  Observe Suppose: Given i, can efficiently generate Z i Then: can maintain sketch for update streams Problem oMust generate Z i =h(i) on first appearance of i oNeed Ω(N) memory to store h explicitly oNeed Ω(N) random bits

33 CS 361A 33 Two birds, One stone  Pairwise Independent Z 1,Z 2, …, Z n for all Z i and Z k, P[Z i =x, Z k =y] = P[Z i =x].P[Z k =y] property E[Z i Z k ] = E[Z i ].E[Z k ]  Example – linear hash function Seed S= from [0..p-1], where p is prime Z i = h(i) = ai+b (mod p)  Claim: Z 1,Z 2, …, Z n are pairwise independent Z i =x and Z k =y  x=ai+b (mod p) and y=ak+b (mod p) fixing i, k, x, y  unique solution for a, b P[Z i =x, Z k =y] = 1/ p 2 = P[Z i =x].P[Z k =y]  Memory/Randomness: n log p  2 log p

34 CS 361A 34 Wait a minute!  Doesn’t pairwise independence screw up proofs?  No – E[X 2 ] calculation only has degree-2 terms  But – what about Var[X 2 ]?  Need 4-wise independence

35 CS 361A 35 Application – Join-Size Estimation  Given Join attribute frequencies f 1 and f 2 Join size = f 1.f 2  Define – X 1 = f 1.Z and X 2 = f 2.Z  Choose – Z as 4-wise independent & uniform {-1,1}  Exercise: Show, as before, E[X 1 X 2 ] = f 1.f 2 Var[X 1 X 2 ] ≤ 2 (f 1.f 2 ) 2 Hint: a.b ≤ |a|.|b|

36 CS 361A 36 Bounding Error Probability  Using s copies of X’s & taking their mean Y Pr[ |Y- f 1.f 2 | ≥ λ f 1.f 2 ] ≤ Var(Y) / λ 2 (f 1.f 2 ) 2 ≤ 2f 1 2 f 2 2 / sλ 2 (f 1.f 2 ) 2 = 2 / sλ 2 cos 2 θ  Bounding error probability? Need – s > 2/λ 2 cos 2 θ  Memory? – O( log 1/ε cos -2 θ λ -2 (log N + log m))  Problem To choose s – need a-priori lower bound on cos θ = f 1.f 2 What if cos θ really small?

37 CS 361A 37 Sketch Partitioning dom(R1.A) 10 1 2 dom(R2.B) 10 12 self-join(R1.A)*self-join(R2.B) = 205*205 = 42K self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K Idea for dealing with f 1 2 f 2 2 /(f 1.f 2 ) 2 issue -- partition domain into regions where self-join size is smaller to compensate small join-size (cos θ)

38 CS 361A 38 Sketch Partitioning  Idea intelligently partition join-attribute space need coarse statistics on stream build independent sketches for each partition  Estimate = Σ partition sketches  Variance = Σ partition variances

39 CS 361A 39 Sketch Partitioning  Partition Space Allocation? Can solve optimally, given domain partition  Optimal Partition: Find K-partition to minimize  Results Dynamic Programming – optimal solution for single join NP-hard – for queries with multiple joins

40 CS 361A 40 F k for k > 2  Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)  Choose – random stream item a p  p uniform from {1,2,…,m}  Suppose – a p = v ε {0,1,…,N-1}  Count subsequent frequency of v r = | {q | q≥p, a q =v} |  Define X = m(r k – (r-1) k )

41 CS 361A 41 Example  Stream 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8  m = 20  p = 9  a p = 5  r = 3

42 CS 361A 42 F k for k > 2  Var(X) ≤ kN 1 – 1/k F k 2  Bounded Error Probability  s = O(kN 1 – 1/k / λ 2 )  Boosting  memory bound O(kn 1 – 1/k λ -2 (log 1/ε)(log N + log m)) Summing over m choices of stream elements

43 CS 361A 43 Frequency Moments  F 0 – distinct values problem (Lecture 15)  F 1 – sequence length for case with deletions, use Cauchy distribution  F 2 – self-join size/Gini index (Today)  F k for k >2 omitting grungy details can achieve space bound O(kN 1 – 1/k λ -2 (log 1/ε)(log n + log m))  F ∞ – maximum frequency

44 CS 361A 44 Communication Complexity  Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power  Communication Complexity C(f) – bits exchanged by optimal protocol Π  Protocols? 1-way versus 2-way deterministic versus randomized  C δ (f) – randomized complexity for error probability δ ALICE input A BOB input B

45 CS 361A 45 Streaming & Communication Complexity  Stream Algorithm  1-way communication protocol  Simulation Argument Given – algorithm S computing f over streams Alice – initiates S, providing A as input stream prefix Communicates to Bob – S’s state after seeing A Bob – resumes S, providing B as input stream suffix  Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)

46 CS 361A 46 Example: Set Disjointness  Set Disjointness (DIS) A, B subsets of {1,2,…,N} Output  Theorem: C δ (DIS) = Ω(N), for any δ<1/2

47 CS 361A 47 Lower Bound for F ∞ Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with P[ (1-ε)F ∞ 1-δ needs Ω(N) space Proof Claim: S  1-way protocol for DIS (on any sets A and B) Alice streams set A to S Communicates S’s state to Bob Bob streams set B to S Observe Relative Error ε<1/3  DIS solved exactly! P[error <½ ] < δ  Ω(N) space

48 CS 361A 48 Extensions  Observe Used only 1-way communication in proof C δ (DIS) bound was for arbitrary communication Exercise – extend lower bound to multi-pass algorithms  Lower Bound for F k, k>2 Need to increase gap beyond 2 Multiparty Set Disjointness – t players Theorem: Fix ε,δ 5. Any stream algorithm S with P[ (1-ε)F k 1-δ needs Ω(N 1-(2+ δ)/k ) space Implies Ω(N 1/2 ) even for multi-pass algorithms

49 CS 361A49 Tracking High-Frequency Items

50 CS 361A 50 Problem 1 – Top-K List Problem 1 – Top-K List [Charikar-Chen-Farach-Colton] The Google Problem Return list of k most frequent items in stream Motivation search engine queries, network traffic, … Remember Saw lower bound recently! Solution Data structure Count-Sketch  maintaining count-estimates of high-frequency elements

51 CS 361A 51 Definitions  Notation Assume {1, 2, …, N} in order of frequency m i is frequency of i th most frequent element m = Σm i is number of elements in stream  FindCandidateTop Input: stream S, int k, int p Output: list of p elements containing top k Naive sampling gives solution with p =  (m log k / m k )  FindApproxTop Input: stream S, int k, real  Output: list of k elements, each of frequency m i > (1-  ) m k Naive sampling gives no solution

52 CS 361A 52 Main Idea  Consider single counter X hash function h(i): {1, 2,…,N}  {-1,+1}  Input element i  update counter X += Z i = h(i)  For each r, use XZ r as estimator of m r Theorem: E[XZ r ] = m r Proof X = Σ i m i Z i E[XZ r ] = E[Σ i m i Z i Z r ] = Σ i m i E[Z i Z r ] = m r E[Z r 2 ] = m r Cross-terms cancel

53 CS 361A 53 Finding Max Frequency Element  Problem – var[X] = F 2 = Σ i m i 2  Idea – t counters, independent 4-wise hashes h 1,…,h t  Use t = O(log m  m i 2 / (  m 1 ) 2 )  Claim: New Variance <  m i 2 / t = (  m 1 ) 2 / log m  Overall Estimator repeat + median of averages with high probability, approximate m 1 h 1 : i  {+1, –1} h t : i  {+1, –1}

54 CS 361A 54 Problem with “Array of Counters”  Variance – dominated by highest frequency  Estimates for less-frequent elements like k corrupted by higher frequencies variance >> m k  Avoiding Collisions? spread out high frequency elements replace each counter with hashtable of b counters

55 CS 361A 55 Count Sketch  Hash Functions 4-wise independent hashes h 1,...,h t and s 1,…,s t hashes independent of each other  Data structure: hashtables of counters X(r,c) s 1 : i  {1,..., b} h 1 : i  {+1, -1} s t : i  {1,..., b} h t : i  {+1, -1} 1 2 … b

56 CS 361A 56 Overall Algorithm  s r (i) – one of b counters in rth hashtable  Input i  for each r, update X(r,s r (i)) += h r (i)  Estimator(m i ) = median r { X(r,s r (i)) h r (i) }  Maintain heap of k top elements seen so far  Observe Not completely eliminated collision with high frequency items Few of estimates X(r,s r (i)) h r (i) could have high variance Median not sensitive to these poor estimates

57 CS 361A 57 Avoiding Large Items  b > O(k)  with probability Ω(1), no collision with top-k elements  t hashtables represent independent trials  Need log m/  trials to estimate with probability 1-   Also need – small variance for colliding small elements  Claim: P[variance due to small items in each estimate k m i 2 )/b] = Ω(1)  Final bound b = O(k +  i>k m i 2 / (  m k ) 2 )

58 CS 361A 58 Final Results  Zipfian Distribution: m i  1/i  [Power Law]  FindApproxTop [k + (  i>k m i 2 ) / (  m k ) 2 ] log m/  Roughly: sampling bound with frequencies squared Zipfian – gives improved results  FindCandidateTop Zipf parameter 0.5 O(k log N log m) Compare: sampling bound O((kN) 0.5 log k)

59 CS 361A 59 Problem 2 – Elephants-and-Ants Problem 2 – Elephants-and-Ants [Manku-Motwani]  Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream

60 CS 361A 60 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size w is function of support s – specify later… Window 1Window 2Window 3

61 CS 361A 61 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

62 CS 361A 62 Lossy Counting (continued) At window boundary, decrement all counters by 1 Next Window +

63 CS 361A 63 Error Analysis If current size of stream = N and window-size w = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

64 CS 361A 64 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need?  Worst case bound: 1/ε log εN counters  Implementation details…

65 CS 361A 65 Number of Counters?  Window size w = 1/   Number of windows m =  N  n i – # counters alive over last i windows  Fact:  Claim: Counter must average 1 increment/window to survive  # active counters

66 CS 361A 66 Enhancements Frequency Errors For counter (X, c), true frequency in [c, c+ ε N ] Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1] Batch Processing Decrements after k windows If (t = 1), no error!

67 CS 361A 67 Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate? 34 15 30 28 31 41 23 35 19

68 CS 361A 68 Sticky Sampling (continued) For finite stream of length N Sampling rate = 2/εN log 1/  s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N  = probability of failure

69 CS 361A 69 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/  s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/  log 1/  s

