Data Stream Algorithms Frequency Moments

Slides:

Advertisements

Similar presentations

3.6 Support Vector Machines

Advertisements

2. Getting Started Heejin Park College of Information and Communications Hanyang University.

October 31, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL13.1 Introduction to Algorithms LECTURE 11 Amortized Analysis Dynamic tables.

Estimating Distinct Elements, Optimally

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

STATISTICS Random Variables and Distribution Functions

October 17, 2005 Copyright© Erik D. Demaine and Charles E. Leiserson L2.1 Introduction to Algorithms 6.046J/18.401J LECTURE9 Randomly built binary.

Introduction to Algorithms 6.046J/18.401J

Introduction to Algorithms 6.046J/18.401J

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Reductions Complexity ©D.Moshkovitz.

Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.

Chapter 7 Sampling and Sampling Distributions

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.

David Luebke 1 6/1/2014 CS 332: Algorithms Medians and Order Statistics Structures for Dynamic Sets.

Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.

Discrete Math Recurrence Relations 1.

Factoring Quadratics — ax² + bx + c Topic

David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.

Modern Programming Languages, 2nd ed.

Briana B. Morrison Adapted from William Collins

Analysis of Algorithms CS 477/677

Copyright © 2013, 2009, 2005 Pearson Education, Inc.

MAT 205 F08 Chapter 12 Complex Numbers.

LIAL HORNSBY SCHNEIDER

LT Codes Paper by Michael Luby FOCS ‘02 Presented by Ashish Sabharwal Feb 26, 2003 CSE 590vg.

Direct-Current Circuits

1 Motion and Manipulation Configuration Space. Outline Motion Planning Configuration Space and Free Space Free Space Structure and Complexity.

6.4 Best Approximation; Least Squares

10 -1 Chapter 10 Amortized Analysis A sequence of operations: OP 1, OP 2, … OP m OP i : several pops (from the stack) and one push (into the stack)

PSSA Preparation.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

9. Two Functions of Two Random Variables

Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.

Instructor: Shengyu Zhang 1. Content Two problems  Minimum Spanning Tree  Huffman encoding One approach: greedy algorithms 2.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Commonly Used Distributions

Chapter 5 The Mathematics of Diversification

Copyright © Cengage Learning. All rights reserved.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Fast Algorithms For Hierarchical Range Histogram Constructions

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Data Stream Algorithms Lower Bounds Graham Cormode

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Finding Frequent Items in Data Streams

Streaming & sampling.

Sublinear Algorithmic Tools 2

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Range-Efficient Counting of Distinct Elements

Range-Efficient Computation of F0 over Massive Data Streams

Presentation transcript:

Data Stream Algorithms Frequency Moments Graham Cormode graham@research.att.com

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Last Time Introduced data streams and data stream models Focus on a stream defining a frequency distribution Sampling to draw a uniform sample from the stream Entropy estimation: based on sampling Data Stream Algorithms - Frequency Moments

This Time: Frequency Moments Given a stream of updates, let fi be the number of times that item i is seen in the stream Define Fk of the stream as i (fi)k – the k’th Frequency Moment “Space Complexity of the Frequency Moments” by Alon, Matias, Szegedy in STOC 1996 studied this problem Awarded Godel prize in 2005 Set the pattern for much streaming algorithms to follow Frequency moments are at the core of many streaming problems Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F0 : count 1 if fi  0 – number of distinct items F1 : length of stream, easy F2 : sum the squares of the frequencies – self join size Fk : related to statistical moments of the distribution F : dominated by the largest fk, finds the largest frequency Different techniques needed for each one. Mostly sketch techniques, which compute a certain kind of random linear projection of the stream Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Sketches Not every problem can be solved with sampling Example: counting how many distinct items in the stream If a large fraction of items aren’t sampled, don’t know if they are all same or all different Other techniques take advantage that the algorithm can “see” all the data even if it can’t “remember” it all (To me) a sketch is a linear transform of the input Model stream as defining a vector, sketch is result of multiplying stream vector by an (implicit) matrix linear projection Data Stream Algorithms - Frequency Moments

Trivial Example of a Sketch 1 0 1 1 1 0 1 0 1 … 1 0 1 1 0 0 1 0 1 … Test if two (asynchronous) binary streams are equal d= (x,y) = 0 iff x=y, 1 otherwise To test in small space: pick a random hash function h Test h(x)=h(y) : small chance of false positive, no chance of false negative. Compute h(x), h(y) incrementally as new bits arrive (Karp-Rabin: h(x) = xi2i mod p for random prime p) Exercise: extend to real valued vectors in update model Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Count-Min Sketch Simple sketch idea, can be used for as the basis of many different stream mining tasks. Model input stream as a vector x of dimension U Creates a small summary as an array of w  d in size Use d hash function to map vector entries to [1..w] Works on arrivals only and arrivals & departures streams W Array: CM[i,j] d Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments CM Sketch Structure +c h1(j) hd(j) j,+c d=log 1/ w = 2/ Each entry in vector x is mapped to one bucket per row. Merge two sketches by entry-wise summation Estimate x[j] by taking mink CM[k,hk(j)] Guarantees error less than eF1 in size O(1/e log 1/d) Probability of more error is less than 1-d [C, Muthukrishnan ’04] Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Approximation Approximate x’[j] = mink CM[k,hk(j)] Analysis: In k'th row, CM[k,hk(j)] = x[j] + Xk,j Xk,j = S x[i] | hk(i) = hk(j) E(Xk,j) = Si j x[i]*Pr[hk(i)=hk(j)]  Pr[hk(i)=hk(k)] * Si x[i] = e F1/2 by pairwise independence of h Pr[Xk,j  eF1] = Pr[Xk,j  2E(Xk,j)]  1/2 by Markov inequality So, Pr[x’[j] x[j] + e F1] = Pr[ k. Xk,j>e F1] 1/2log 1/d = d Final result: with certainty x[j]  x’[j] and with probability at least 1-d, x’[j]< x[j] + e F1 Data Stream Algorithms - Frequency Moments

Applications of CM to F CM sketch lets us estimate fi for any i F asks to find maxi fi Slow way: test every i after creating sketch Faster way: test every i after it is seen in the stream, and remember largest estimated value Alternate way: keep a binary tree over the domain of input items, where each node corresponds to a subset keep sketches of all nodes at same level descend tree to find large frequencies, discarding branches with low frequency Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Count-Min Exercises The median of a distribution is the item so that the sum of the frequencies of lexicographically smaller items is ½ F1. Use CM sketch to find the (approximate) median. Assume the input frequencies follow the Zipf distribution so that the i’th largest frequency is (i-z) for z>1. Show that CM sketch only needs to be size -1/z to give same guarantee Suppose we have arrival and departure streams where the frequencies of items are allowed to be negative. Extend CM sketch analysis to estimate these frequencies (note, Markov argument no longer works) How to find the large absolute frequencies when some are negative? Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F2 estimation AMS sketch (for Alon-Matias-Szegedy) proposed in 1996 Allows estimation of F2 (second frequency moment) Used at the heart of many streaming and non-streaming mining applications: achieves dimensionality reduction Here, describe AMS sketch by generalizing CM sketch. Uses extra hash functions g1...glog 1/d {1...U} {+1,-1} Now, given update (j,+c), set CM[k,hk(i)] += c*gk(j) linear projection AMS sketch Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F2 analysis +c*g1(j) +c*g2(j) +c*g3(j) +c*g4(j) h1(j) hd(j) j,+c d=8log 1/ w = 4/2 Estimate F2 = mediank åi CM[k,i]2 Each row’s result is åi g(i)2xi2 + åh(i)=h(j) 2 g(i) g(j) fi fj But g(i)2 = -12 = +12 = 1, and åi xi2 = F2 g(i)g(j) has 1/2 chance of +1 or –1 : expectation is 0 … Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F2 Variance Expectation of row estimate is exactly F2 Variance of row k is an expectation: Vark = E[ buckets b (CM[k,b])4 – F22 ] Good exercise in algebra: expand this sum and simplify Many terms are zero in expectation because of terms like g(a)g(b)g(c)g(d) (degree at most 4) Requires that hash function g is four-wise independent: it behaves uniformly over subsets of size four or smaller Such hash functions are easy to construct Row variance can finally be bounded by F22/w Chebyshev for w=4/2 gives probability ¼ of failure How to amplify this to small  probability of failure? Data Stream Algorithms - Frequency Moments

Tail Inequalities for Sums We derive stronger bounds on tail probabilities for the sum of independent Bernoulli trials via the Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials s.t. Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let X = i=1m Xi ,and  = mp be the expectation of X. Then, for any >0, Data Stream Algorithms - Frequency Moments

Applying Chernoff Bound Each row gives an estimate that is within  relative error with probability p > ¾ Take d repetitions and find the median. Why the median? Because bad estimates are either too small or too large Good estimates form a contiguous group “in the middle” At least d/2 estimates must be bad for median to be bad Apply Chernoff bound to d independent estimates, p=3/4 Pr[ More than d/2 bad estimates ] < 2exp(d/8) So we set d = (ln ) to give  probability of failure Same outline used many times in data streams Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Aside on Independence Full independence is expensive in a streaming setting If hash functions are fully independent over n items, then we need (n) space to store their description Pairwise and four-wise independent hash functions can be described in a constant number of words The F2 algorithm uses a careful mix of limited and full independence Each hash function is four-wise independent over all n items Each repetition is fully independent of all others – but there are only O(log 1/) repetitions. Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments AMS Sketch Exercises Let x and y be binary streams of length n. The Hamming distance H(x,y) = |{i | x[i] y[i]}| Show how to use AMS sketches to approximate H(x,y) Extend for strings drawn from an arbitrary alphabet The inner product of two strings x, y is x y = i=1n x[i]*y[i] Use AMS sketches to estimate x  y Hint: try computing the inner product of the sketches. Show the estimator is unbiased (correct in expectation) What form does the error in the approximation take? Use Count-Min Sketches for the same problem and compare the errors. Is it possible to build a (1) approximation of x  y? Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F0 Estimation F0 is the number of distinct items in the stream a fundamental quantity with many applications Early algorithms by Flajolet and Martin [1983] gave nice hashing-based solution analysis assumed fully independent hash functions Will describe a generalized version of the FM algorithm due to Bar-Yossef et. al with only pairwise indendence Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F0 Algorithm Let m be the domain of stream elements Each item in stream is from [1…m] Pick a random hash function h: [m]  [m3] With probability at least 1-1/m, no collisions under h For each stream item i, compute h(i), and track the t distinct items achieving the smallest values of h(i) Note: if same i is seen many times, h(i) is same Let vt = t’th smallest value of h(i) seen. If F0 < t, give exact answer, else estimate F’0 = tm3/vt vt/m3  fraction of hash domain occupied by t smallest vt 0m3 m3 Data Stream Algorithms - Frequency Moments

Analysis of F0 algorithm Suppose F’0 = tm3/vt > (1+) F0 [estimate is too high] So for stream = set S  2[m], we have |{ s  S | h(s) < tm3/(1+)F0 }| > t Because  < 1, we have tm3/(1+)F0  (1-/2)tm3/F0 Pr[ h(s) < (1-/2)tm3/F0]  1/m3 * (1-/2)tm3/F0 = (1-/2)t/F0 (this analysis outline hides some rounding issues) 0m3 vt tm3/(1+)F0 m3 Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Chebyshev Analysis Let Y be number of items hashing to under tm3/(1+)F0 E[Y] = F0 * Pr[ h(s) < tm3/(1+)F0] = (1-/2)t For each item i, variance of the event = p(1-p) < p Var[Y] = s  S Var[ h(s) < tm3/(1+)F0] < (1-/2)t We sum variances because of pairwise independence Now apply Chebyshev: Pr[ Y > t ]  Pr[|Y – E[Y]| > t/2]  4Var[Y]/2t2  4/(2t2) Set t=6/2 to make this Prob  1/9 Data Stream Algorithms - Frequency Moments

Completing the analysis We have shown Pr[ F’0 > (1+) F0 ] < 1/9 Can show Pr[ F’0 < (1-) F0 ] < 1/9 similarly too few items hash below a certain value So Pr[ (1-) F0  F’0  (1+)F0] > 7/9 [Good estimate] Amplify this probability: repeat O(log 1/) times in parallel with different choices of hash function h Take the median of the estimates, analysis as before Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F0 Issues Space cost: Store t hash values, so O(1/2 log m) bits Can improve to O(1/2 + log m) with additional tricks Time cost: Find if hash value h(i) < vt Update vt and list of t smallest if h(i) not already present Total time O(log 1/ + log m) worst case Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Range Efficiency Sometimes input is specified as a stream of ranges [a,b] [a,b] means insert all items (a, a+1, a+2 … b) Trivial solution: just insert each item in the range Range efficient F0 [Pavan, Tirthapura 05] Start with an alg for F0 based on pairwise hash functions Key problem: track which items hash into a certain range Dives into hash fns to divide and conquer for ranges Range efficient F2 [Calderbank et al. 05, Rusu,Dobra 06] Start with sketches for F2 which sum hash values Design new hash functions so that range sums are fast Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F0 Exercises Suppose the stream consists of a sequence of insertions and deletions. Design an algorithm to approximate F0 of the current set. What happens when some frequencies are negative? Give an algorithm to find F0 of the most recent W arrivals Use F0 algorithms to approximate Max-dominance: given a stream of pairs (i,x(i)), approximate j max(i, x(i)) x(i) Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments

Higher Frequency Moments Fk for k>2. Use sampling trick as with Entropy [Alon et al 96]: Uniformly pick an item from the stream length 1…n Set r = how many times that item appears subsequently Set estimate F’k = n(rk – (r-1)k) E[F’k]=1/n*n*[ f1k - (f1-1)k + (f1-1)k - (f1-2)k + … + 1k-0k]+… = f1k + f2k + … = Fk Var[F’k]1/n*n2*[(f1k-(f1-1)k)2 + …] Use various bounds to bound the variance by k m1-1/k Fk2 Repeat k m1-1/k times in parallel to reduce variance Total space needed is O(k m1-1/k) machine words Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Improvements [Coppersmith and Kumar 04’]: Generalize the F2 approach E.g. For F3, set p=1/m, and hash items onto {1-1/p, -1/p} with probability {1/p, 1-1/p} respectively. Compute cube of sum of the hash values of the stream Correct in expectation, bound variance  O(mF32) [Indyk, Woodruff 05, Bhuvangiri et al. 06]: Optimal solutions by extracting different frequencies Use hashing to sample subsets of items and fi’s Combine these to build the correct estimator Cost is O(m1-2/k poly-log(m,n,1/)) space Data Stream Algorithms - Frequency Moments

Combined Frequency Moments Consider network traffic data: defines a communication graph eg edge: (source, destination) or edge: (source:port, dest:port) Defines a (directed) multigraph We are interested in the underlying (support) graph on n nodes Want to focus on number of distinct communication pairs, not size of communication So want to compute moments of F0 values... Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Multigraph Problems Let G[i,j] = 1 if (i,j) appears in stream: edge from i to j. Total of m distinct edges Let di = Sj=1n G[i,j] : degree of node i Find aggregates of di’s: Estimate heavy di’s (people who talk to many) Estimate frequency moments: number of distinct di values, sum of squares Range sums of di’s (subnet traffic) Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments F (F0) using CM-FM Find i’s such that di > f åi di Finds the people that talk to many others Count-Min sketch only uses additions, so can apply: Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Accuracy for F(F0) Focus on point query accuracy: estimate di. Can prove estimate has only small bias in expectation Analysis is similar to original CM sketch analysis, but now have to take account of F0 estimation of counts Gives an bound of O(1/3 poly-log(n)) space: The product of the size of the sketches Remains to fully understand other combinations of frequency moments, eg. F2(F0), F2(F2) etc. Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Exercises / Problems (Research problem) Read, understand and simplify analysis for optimal Fk estimation algorithms Take the sampling Fk algorithm and combine it with F0 estimators to approximate Fk of node degrees Why can’t we use the sketch approach for F2 of node degrees? Show there the analysis breaks down (Research problem) What can be computed for other combinations of frequency moments, e.g. F2 of F2 values, etc.? Data Stream Algorithms - Frequency Moments

Data Stream Algorithms - Frequency Moments Introduction to Frequency Moments and Sketches Count-Min sketch for F and frequent items AMS Sketch for F2 Estimating F0 Extensions: Higher frequency moments Combined frequency moments Data Stream Algorithms - Frequency Moments