Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.

Similar presentations


Presentation on theme: "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani."— Presentation transcript:

1 CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani

2 CS 361A 2 Game Plan for Week  Last Class Models for Streaming/Massive Data Sets Negative results for Exact Distinct Values Hashing for Approximate Distinct Values  Today Synopsis Data Structures Sampling Techniques Frequency Moments Problem Sketching Techniques Finding High-Frequency Items

3 CS 361A 3 Synopsis Data Structures  Synopses Webster – a condensed statement or outline (as of a narrative or treatise) CS 361A – succinct data structure that lets us answers queries efficiently  Synopsis Data Structures “Lossy” Summary (of a data stream) Advantages – fits in memory + easy to communicate Disadvantage – lossiness implies approximation error Negative Results  best we can do Key Techniques – randomization and hashing

4 CS 361A 4 Numerical Examples  Approximate Query Processing [AQUA/Bell Labs] Database Size – 420 MB Synopsis Size – 420 KB (0.1%) Approximation Error – within 10% Running Time – 0.3% of time for exact query  Histograms/Quantiles [Chaudhuri-Motwani-Narasayya, Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 10 9 items Synopsis Size – 1249 items Approximation Error – within 1%

5 CS 361A 5 Synopses  Desidarata Small Memory Footprint Quick Update and Query Provable, low-error guarantees Composable – for distributed scenario  Applicability? General-purpose – e.g. random samples Specific-purpose – e.g. distinct values estimator  Granularity? Per database – e.g. sample of entire table Per distinct value – e.g. customer profiles Structural – e.g. GROUP-BY or JOIN result samples

6 CS 361A 6 Examples of Synopses  Synopses need not be fancy! Simple Aggregates – e.g. mean/median/max/min Variance?  Random Samples Aggregates on small samples represent entire data Leverage extensive work on confidence intervals  Random Sketches structured samples  Tracking High-Frequency Items

7 CS 361A7 Random Samples

8 CS 361A 8 Types of Samples  Oblivious sampling – at item level oLimitations [Bar-Yossef–Kumar–Sivakumar STOC 01]  Value-based sampling – e.g. distinct-value samples  Structured samples – e.g. join sampling Naïve approach – keep samples of each relation Problem – sample-of-join ‡ join-of-samples Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99] what if A sampled from L and B from R? AABBAABB LR ABAB

9 CS 361A 9 Basic Scenario  Goal maintain uniform sample of item-stream  Sampling Semantics? Coin flip oselect each item with probability p oeasy to maintain oundesirable – sample size is unbounded Fixed-size sample without replacement oOur focus today Fixed-size sample with replacement oShow – can generate from previous sample  Non-Uniform Samples [ Chaudhuri-Motwani-Narasayya]

10 CS 361A 10 Reservoir Sampling [Vitter]  Input – stream of items X 1, X 2, X 3, …  Goal – maintain uniform random sample S of size n (without replacement) of stream so far  Reservoir Sampling Initialize – include first n elements in S Upon seeing item X t oAdd X t to S with probability n/t oIf added, evict random previous item

11 CS 361A 11 Analysis  Correctness? Fact: At each instant, |S| = n Theorem: At time t, any X i εS with probability n/t Exercise – prove via induction on t  Efficiency? Let N be stream size Remark: Verify this is optimal.  Naïve implementation  N coin flips  time O(N)

12 CS 361A 12 Improving Efficiency  Random variable J t – number jumped over after time t  Idea – generate J t and skip that many items  Cumulative Distribution Function – F(s) = P[J t ≤ s], for t>n & s≥0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 items inserted into sample S (where n=3) J 9 =4J 3 =2

13 CS 361A 13 Analysis  Number of calls to RANDOM()? one per insertion into sample this is optimal!  Generating J t ? Pick random number U ε [0,1] Find smallest j such that U ≤ F(j) How? oLinear scan  O(N) time oBinary search with Newton’s interpolation  O(n 2 (1 + polylog N/n)) time  Remark – see paper for optimal algorithm

14 CS 361A 14 Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]  Sliding Window W – last w items in stream  Model – item X t expires at time t+w  Why? Applications may require ignoring stale data Type of approximation Only way to define JOIN over streams  Goal – Maintain uniform sample of size n of sliding window

15 CS 361A 15 Reservoir Sampling?  Observe any item in sample S will expire eventually must replace with random item of current window  Problem no access to items in W-S storing entire window requires O(w) memory  Oversampling Backing sample B – select each item with probability sample S – select n items from B at random upon expiry in S  replenish from B Claim: n < |B| < n log w with high probability

16 CS 361A 16 Index-Set Approach  Pick random index set I= { i 1, …, i n }, X  {0,1, …, w-1}  Sample S – items X i with i ε {i 1, …, i n } (mod w) in current window  Example Suppose – w=2, n=1, and I={1} Then – sample is always X i with odd i  Memory – only O(k)  Observe S is uniform random sample of each window But sample is periodic (union of arithmetic progressions) Correlation across successive windows  Problems Correlation may hurt in some applications Some data (e.g. time-series) may be periodic

17 CS 361A 17 Chain-Sample Algorithm  Idea Fix expiry problem in Reservoir Sampling Advance planning for expiry of sampled items Focus on sample size 1 – keep n independent such samples  Chain-Sampling Add X t to S with probability 1/min{t,w} – evict earlier sample Initially – standard Reservoir Sampling up to time w Pre-select X t ’s replacement X r ε W t+w = {X t+1, …, X t+w } oX t expires  must replace from W t+w oAt time r, save X r and pre-select its own replacement  building “chain” of potential replacements Note – if evicting earlier sample, discard its “chain” as well

18 CS 361A 18 Example 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

19 CS 361A 19 Expectation for Chain-Sample  T(x) = E[chain length for X t at time t+x]  E[chain length] = T(w)  e  2.718  E[memory required for sample size n] = O(n)

20 CS 361A 20 Tail Bound for Chain-Sample  Chain = “hops” of total length at most w  Chain of h hops  ordered (h+1)-partition of w h hops of total length less than w plus, remainder  Each partition has probability w -h  Number of partitions:  h = O(log w)  probability of a partition is O(w -c )  Thus – memory O(n log w) with high probability

21 CS 361A 21 Comparison of Algorithms  Chain-Sample beats Oversample: Expected memory – O(n) vs O(n log w) High-probability memory bound – both O(n log w) Oversample may have sample size shrink below n! AlgorithmExpected High-Probability PeriodicO(n) OversampleO(n log w) Chain-SampleO(n)O(n log w)

22 CS 361A22 Sketches and Frequency Moments

23 CS 361A 23 Generalized Stream Model  Input Element (i,a) a copies of domain-value i increment to ith dimension of m by a a need not be an integer  Negative value – captures deletions Data stream: 2, 0, 1, 3, 1, 2, 4,... m 0 m 1 m 2 m 3 m 4 111 22

24 CS 361A 24 Example m 0 m 1 m 2 m 3 m 4 111 22 On seeing element (i,a) = (2,2) m 0 m 1 m 2 m 3 m 4 111 2 4 On seeing element (i,a) = (1,-1) m 0 m 1 m 2 m 3 m 4 1 1 1 4 1

25 CS 361A 25 Frequency Moments  Input Stream values from U = {0,1,…,N-1} frequency vector m = (m 0,m 1,…,m N-1 )  Kth Frequency Moment F k (m) = Σ i m i k F 0 : number of distinct values (Lecture 15) F 1 : stream size F 2 : Gini index, self-join size, Euclidean norm F k : for k>2, measures skew, sometimes useful F ∞ : maximum frequency  Problem – estimation in small space  Sketches – randomized estimators

26 CS 361A 26 Naive Approaches  Space N – counter m i for each distinct value i  Space O(1) if input sorted by i single counter recycled when new i value appears  Goal Allow arbitrary input Use small (logarithmic) space Settle for randomization/approximation

27 CS 361A 27 Sketching F 2  Random Hash h(i): {0,1,…,N-1}  {-1,1}  Define Z i =h(i)  Maintain X = Σ i m i Z i Easy for update streams (i,a) – just add aZ i to X  Claim: X 2 is unbiased estimator for F 2 Proof: E[X 2 ] = E[(Σ i m i Z i ) 2 ] = E[Σ i m i 2 Z i 2 ] + E[Σ i,j m i m j Z i Z j ] = Σ i m i 2 E[Z i 2 ] + Σ i,j m i m j E[Z i ]E[Z j ] = Σ i m i 2 + 0 = F 2  Last Line? – Z i 2 = 1 and E[Z i ] = 0 as uniform{-1,1} from independence

28 CS 361A 28 Estimation Error?  Chebyshev bound:  Define Y = X 2  E[Y] = E[X 2 ] = Σ i m i 2 = F 2  Observe E[X 4 ] = E[(Σm i Z i ) 4 ] = E[Σm i 4 Z i 4 ]+4E[Σm i m j 3 Z i Z j 3 ]+6E[Σm i 2 m j 2 Z i 2 Z j 2 ] +12E[Σm i m j m k 2 Z i Z j Z k 2 ]+24E[Σm i m j m k m l Z i Z j Z k Z l ] = Σm i 4 + 6Σm i 2 m j 2  By definition Var[Y] = E[Y 2 ] – E[Y] 2 = E[X 4 ] – E[X 2 ] 2 = [Σm i 4 +6Σm i 2 m j 2 ] – [Σm i 4 +2Σm i 2 m j 2 ] = 4Σm i 2 m j 2 ≤ 2E[X 2 ] 2 = 2F 2 2 Why?

29 CS 361A 29 Estimation Error?  Chebyshev bound  P [relative estimation error >λ]  Problem – What if we want λ really small?  Solution Compute s = 8/ λ 2 independent copies of X Estimator Y = mean(X i 2 ) Variance reduces by factor s  P [relative estimation error >λ]

30 CS 361A 30 Boosting Technique  Algorithm A: Randomized λ-approximate estimator f P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4  Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]  Boosting Idea O(log1/ε) independent estimates from A(X) Return median of estimates  Claim: P[median is λ-approximate] >1- ε Proof: P[specific estimate is λ-approximate] = ¾ Bad event only if >50% estimates not λ-approximate Binomial tail – probability less than ε

31 CS 361A 31 Overall Space Requirement  Observe Let m = Σm i Each hash needs O(log m)-bit counter s = 8/ λ 2 hash functions for each estimator O(log 1/ε) such estimators  Total O( λ -2 log 1/ε log m) bits  Question – Space for storing hash function?

32 CS 361A 32 Sketching Paradigm  Random Sketch: inner product frequency vector m = (m 0,m 1,…,m N-1 ) random vector Z (currently, uniform {-1,1})  Observe Linearity  Sketch(m 1 ) ± Sketch(m 2 ) = Sketch (m 1 ± m 2 ) Ideal for distributed computing  Observe Suppose: Given i, can efficiently generate Z i Then: can maintain sketch for update streams Problem oMust generate Z i =h(i) on first appearance of i oNeed Ω(N) memory to store h explicitly oNeed Ω(N) random bits

33 CS 361A 33 Two birds, One stone  Pairwise Independent Z 1,Z 2, …, Z n for all Z i and Z k, P[Z i =x, Z k =y] = P[Z i =x].P[Z k =y] property E[Z i Z k ] = E[Z i ].E[Z k ]  Example – linear hash function Seed S= from [0..p-1], where p is prime Z i = h(i) = ai+b (mod p)  Claim: Z 1,Z 2, …, Z n are pairwise independent Z i =x and Z k =y  x=ai+b (mod p) and y=ak+b (mod p) fixing i, k, x, y  unique solution for a, b P[Z i =x, Z k =y] = 1/ p 2 = P[Z i =x].P[Z k =y]  Memory/Randomness: n log p  2 log p

34 CS 361A 34 Wait a minute!  Doesn’t pairwise independence screw up proofs?  No – E[X 2 ] calculation only has degree-2 terms  But – what about Var[X 2 ]?  Need 4-wise independence

35 CS 361A 35 Application – Join-Size Estimation  Given Join attribute frequencies f 1 and f 2 Join size = f 1.f 2  Define – X 1 = f 1.Z and X 2 = f 2.Z  Choose – Z as 4-wise independent & uniform {-1,1}  Exercise: Show, as before, E[X 1 X 2 ] = f 1.f 2 Var[X 1 X 2 ] ≤ 2 (f 1.f 2 ) 2 Hint: a.b ≤ |a|.|b|

36 CS 361A 36 Bounding Error Probability  Using s copies of X’s & taking their mean Y Pr[ |Y- f 1.f 2 | ≥ λ f 1.f 2 ] ≤ Var(Y) / λ 2 (f 1.f 2 ) 2 ≤ 2f 1 2 f 2 2 / sλ 2 (f 1.f 2 ) 2 = 2 / sλ 2 cos 2 θ  Bounding error probability? Need – s > 2/λ 2 cos 2 θ  Memory? – O( log 1/ε cos -2 θ λ -2 (log N + log m))  Problem To choose s – need a-priori lower bound on cos θ = f 1.f 2 What if cos θ really small?

37 CS 361A 37 Sketch Partitioning dom(R1.A) 10 1 2 dom(R2.B) 10 12 self-join(R1.A)*self-join(R2.B) = 205*205 = 42K self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K Idea for dealing with f 1 2 f 2 2 /(f 1.f 2 ) 2 issue -- partition domain into regions where self-join size is smaller to compensate small join-size (cos θ)

38 CS 361A 38 Sketch Partitioning  Idea intelligently partition join-attribute space need coarse statistics on stream build independent sketches for each partition  Estimate = Σ partition sketches  Variance = Σ partition variances

39 CS 361A 39 Sketch Partitioning  Partition Space Allocation? Can solve optimally, given domain partition  Optimal Partition: Find K-partition to minimize  Results Dynamic Programming – optimal solution for single join NP-hard – for queries with multiple joins

40 CS 361A 40 F k for k > 2  Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)  Choose – random stream item a p  p uniform from {1,2,…,m}  Suppose – a p = v ε {0,1,…,N-1}  Count subsequent frequency of v r = | {q | q≥p, a q =v} |  Define X = m(r k – (r-1) k )

41 CS 361A 41 Example  Stream 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8  m = 20  p = 9  a p = 5  r = 3

42 CS 361A 42 F k for k > 2  Var(X) ≤ kN 1 – 1/k F k 2  Bounded Error Probability  s = O(kN 1 – 1/k / λ 2 )  Boosting  memory bound O(kn 1 – 1/k λ -2 (log 1/ε)(log N + log m)) Summing over m choices of stream elements

43 CS 361A 43 Frequency Moments  F 0 – distinct values problem (Lecture 15)  F 1 – sequence length for case with deletions, use Cauchy distribution  F 2 – self-join size/Gini index (Today)  F k for k >2 omitting grungy details can achieve space bound O(kN 1 – 1/k λ -2 (log 1/ε)(log n + log m))  F ∞ – maximum frequency

44 CS 361A 44 Communication Complexity  Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power  Communication Complexity C(f) – bits exchanged by optimal protocol Π  Protocols? 1-way versus 2-way deterministic versus randomized  C δ (f) – randomized complexity for error probability δ ALICE input A BOB input B

45 CS 361A 45 Streaming & Communication Complexity  Stream Algorithm  1-way communication protocol  Simulation Argument Given – algorithm S computing f over streams Alice – initiates S, providing A as input stream prefix Communicates to Bob – S’s state after seeing A Bob – resumes S, providing B as input stream suffix  Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)

46 CS 361A 46 Example: Set Disjointness  Set Disjointness (DIS) A, B subsets of {1,2,…,N} Output  Theorem: C δ (DIS) = Ω(N), for any δ<1/2

47 CS 361A 47 Lower Bound for F ∞ Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with P[ (1-ε)F ∞ 1-δ needs Ω(N) space Proof Claim: S  1-way protocol for DIS (on any sets A and B) Alice streams set A to S Communicates S’s state to Bob Bob streams set B to S Observe Relative Error ε<1/3  DIS solved exactly! P[error <½ ] < δ  Ω(N) space

48 CS 361A 48 Extensions  Observe Used only 1-way communication in proof C δ (DIS) bound was for arbitrary communication Exercise – extend lower bound to multi-pass algorithms  Lower Bound for F k, k>2 Need to increase gap beyond 2 Multiparty Set Disjointness – t players Theorem: Fix ε,δ 5. Any stream algorithm S with P[ (1-ε)F k 1-δ needs Ω(N 1-(2+ δ)/k ) space Implies Ω(N 1/2 ) even for multi-pass algorithms

49 CS 361A49 Tracking High-Frequency Items

50 CS 361A 50 Problem 1 – Top-K List Problem 1 – Top-K List [Charikar-Chen-Farach-Colton] The Google Problem Return list of k most frequent items in stream Motivation search engine queries, network traffic, … Remember Saw lower bound recently! Solution Data structure Count-Sketch  maintaining count-estimates of high-frequency elements

51 CS 361A 51 Definitions  Notation Assume {1, 2, …, N} in order of frequency m i is frequency of i th most frequent element m = Σm i is number of elements in stream  FindCandidateTop Input: stream S, int k, int p Output: list of p elements containing top k Naive sampling gives solution with p =  (m log k / m k )  FindApproxTop Input: stream S, int k, real  Output: list of k elements, each of frequency m i > (1-  ) m k Naive sampling gives no solution

52 CS 361A 52 Main Idea  Consider single counter X hash function h(i): {1, 2,…,N}  {-1,+1}  Input element i  update counter X += Z i = h(i)  For each r, use XZ r as estimator of m r Theorem: E[XZ r ] = m r Proof X = Σ i m i Z i E[XZ r ] = E[Σ i m i Z i Z r ] = Σ i m i E[Z i Z r ] = m r E[Z r 2 ] = m r Cross-terms cancel

53 CS 361A 53 Finding Max Frequency Element  Problem – var[X] = F 2 = Σ i m i 2  Idea – t counters, independent 4-wise hashes h 1,…,h t  Use t = O(log m  m i 2 / (  m 1 ) 2 )  Claim: New Variance <  m i 2 / t = (  m 1 ) 2 / log m  Overall Estimator repeat + median of averages with high probability, approximate m 1 h 1 : i  {+1, –1} h t : i  {+1, –1}

54 CS 361A 54 Problem with “Array of Counters”  Variance – dominated by highest frequency  Estimates for less-frequent elements like k corrupted by higher frequencies variance >> m k  Avoiding Collisions? spread out high frequency elements replace each counter with hashtable of b counters

55 CS 361A 55 Count Sketch  Hash Functions 4-wise independent hashes h 1,...,h t and s 1,…,s t hashes independent of each other  Data structure: hashtables of counters X(r,c) s 1 : i  {1,..., b} h 1 : i  {+1, -1} s t : i  {1,..., b} h t : i  {+1, -1} 1 2 … b

56 CS 361A 56 Overall Algorithm  s r (i) – one of b counters in rth hashtable  Input i  for each r, update X(r,s r (i)) += h r (i)  Estimator(m i ) = median r { X(r,s r (i)) h r (i) }  Maintain heap of k top elements seen so far  Observe Not completely eliminated collision with high frequency items Few of estimates X(r,s r (i)) h r (i) could have high variance Median not sensitive to these poor estimates

57 CS 361A 57 Avoiding Large Items  b > O(k)  with probability Ω(1), no collision with top-k elements  t hashtables represent independent trials  Need log m/  trials to estimate with probability 1-   Also need – small variance for colliding small elements  Claim: P[variance due to small items in each estimate k m i 2 )/b] = Ω(1)  Final bound b = O(k +  i>k m i 2 / (  m k ) 2 )

58 CS 361A 58 Final Results  Zipfian Distribution: m i  1/i  [Power Law]  FindApproxTop [k + (  i>k m i 2 ) / (  m k ) 2 ] log m/  Roughly: sampling bound with frequencies squared Zipfian – gives improved results  FindCandidateTop Zipf parameter 0.5 O(k log N log m) Compare: sampling bound O((kN) 0.5 log k)

59 CS 361A 59 Problem 2 – Elephants-and-Ants Problem 2 – Elephants-and-Ants [Manku-Motwani]  Identify items whose current frequency exceeds support threshold s = 0.1%. [Jacobson 2000, Estan-Verghese 2001] Stream

60 CS 361A 60 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window-size w is function of support s – specify later… Window 1Window 2Window 3

61 CS 361A 61 Lossy Counting in Action... Empty At window boundary, decrement all counters by 1

62 CS 361A 62 Lossy Counting (continued) At window boundary, decrement all counters by 1 Next Window +

63 CS 361A 63 Error Analysis If current size of stream = N and window-size w = 1/ ε then # windows = εN Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%  frequency error  How much do we undercount?

64 CS 361A 64 Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N Putting it all together… How many counters do we need?  Worst case bound: 1/ε log εN counters  Implementation details…

65 CS 361A 65 Number of Counters?  Window size w = 1/   Number of windows m =  N  n i – # counters alive over last i windows  Fact:  Claim: Counter must average 1 increment/window to survive  # active counters

66 CS 361A 66 Enhancements Frequency Errors For counter (X, c), true frequency in [c, c+ ε N ] Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1] Batch Processing Decrements after k windows If (t = 1), no error!

67 CS 361A 67 Algorithm 2: Sticky Sampling Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate? 34 15 30 28 31 41 23 35 19

68 CS 361A 68 Sticky Sampling (continued) For finite stream of length N Sampling rate = 2/εN log 1/  s Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01% Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N  = probability of failure

69 CS 361A 69 Number of counters? Finite stream of length N Sampling rate: 2/εN log 1/  s Independent of N Infinite stream with unknown N Gradually adjust sampling rate In either case, Expected number of counters = 2/  log 1/  s

70 CS 361A 70 References – Synopses  Synopsis data structures for massive data sets. Gibbons and Matias, DIMACS 1999. Synopsis data structures for massive data sets  Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS 1999. Tracking Join and Self-Join Sizes in Limited Storage  Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD 1999. Join Synopses for Approximate Query Answering  Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998. Random Sampling for Histogram Construction: How much is enough?  Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD 1999. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets  Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD 2001. Space-efficient online computation of quantile summaries

71 CS 361A 71 References – Sampling  Random Sampling with a Reservoir, Vitter. Transactions on Mathematical Software 11(1):37-57 (1985). Random Sampling with a Reservoir  On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999). On Sampling and Relational Operators.  On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD 1999. On Random Sampling over Joins.  Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000. Congressional Samples for Approximate Answering of Group-By Queries  Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001. Overcoming Limitations of Sampling for Aggregation Queries  A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01. A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries  Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA 2002. Sampling From a Moving Window Over Streaming Data.  Sampling algorithms: lower bounds and applications. Bar-Yossef– Kumar–Sivakumar. STOC 2001. Sampling algorithms: lower bounds and applications

72 CS 361A 72 References – Sketches  Probabilistic counting algorithms for data base applications. Flajolet and Martin. JCSS (1985). Probabilistic counting algorithms for data base applications.  The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC 1996. The space complexity of approximating the frequency moments.  Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB 2002. Approximate Frequency Counts over Streaming Data  Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP 2002. Finding Frequent Items in Data Streams.  An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS 1999. An Approximate L1-Difference Algorithm for Massive Data Streams  Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS 2000. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation


Download ppt "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani."

Similar presentations


Ads by Google