1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Shortest Vector In A Lattice is NP-Hard to approximate

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Order Statistics Sorted

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Mining Data Streams.

Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.

1 Deciding Primality is in P M. Agrawal, N. Kayal, N. Saxena Presentation by Adi Akavia.

Foundations of Cryptography Lecture 4 Lecturer: Moni Naor.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

Randomized Algorithms Tutorial 3 Hints for Homework 2.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Deciding Primality is in P M. Agrawal, N. Kayal, N. Saxena Slides by Adi Akavia.

The Goldreich-Levin Theorem: List-decoding the Hadamard code

1 Lecture 18 Syntactic Web Clustering CS

Evaluating Hypotheses

A survey on stream data mining

1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

The Polynomial Time Algorithm for Testing Primality George T. Gilbert.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.

Chapter 3 Sec 3.3 With Question/Answer Animations 1.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Ariel Rosenfeld.  Input: a stream of m integers i1, i2,..., im. (over 1,…,n)  Output: the number of distinct elements in the stream.  Example – count.

PRIMES is in P Manindra Agrawal NUS Singapore / IIT Kanpur.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.

October 5, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL7.1 Prof. Charles E. Leiserson L ECTURE 8 Hashing II Universal hashing Universality.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Calculating frequency moments of Data Stream

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)

Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Mining Data Streams (Part 1)

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Probabilistic Algorithms

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Finding Frequent Items in Data Streams

Introduction to Algorithms 6.046J/18.401J

Streaming & sampling.

Sublinear Algorithmic Tools 2

Counting How Many Elements Computing “Moments”

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Lecture 4: CountSketch High Frequencies

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18,

2 Data Streams

3 Outline The data stream model Approximate counting Distinct elements Frequency moments

4 The Data Stream Model f: A n  B  A,B arbitrary sets  n: positive integer (think of n as large)  Given x 2 A n, each entry x i is called an “element”.  Typically, A,B are “small” (constant size) sets Goal: given x  A n, compute f(x)  Frequently, approximation of f(x) suffices  Usually, will use randomization Streaming access to input  Algorithm reads input in “sequential passes”  In each pass x is read in the following order: x 1,x 2,…,x n  Impossible: random access, go backwards  Possible: store portions of x (or other functions of x) in memory

5 Complexity Measures Space  Objective: use as little memory as possible  Note: if we allow unlimited space, data stream model is the same as the standard RAM model  Ideally, up to O(log c n) for some constant c Number of passes  Objective: use as few passes as possible  Ideally, only a single pass  Usually, no more than a constant number of passes Running time  Objective: use as little time as possible  Ideally, up to O(n log c n) for some constant c

6 Motivation Types of large data sets:  Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,…  Generated on the fly Data feeds, streaming media, packet streams,… Access to large data sets:  Random access: costly (if data is pre-stored) infeasible (if data is generated on the fly)  Streaming access: the only feasible option Resources:  Memory: the primary bottleneck  Number of passes: a few (if data is pre-stored) single (if data is generated on the fly)  Time: cannot be more than quasi-linear

7 Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x  {0,1} n Goal: find H = number of 1’s in x Naïve solution: just count them!  O(log H) bits of space Can we do better? Answer 1: No!  Information theory implies an  (log H) lower bound Answer 2: Yes!  But only approximately:  Output closest power of 1+  to H  Note: # possible outputs is O(log 1+  H) = O(1/  log H)  Hence, only O(log log H + log(1/  )) bits of space suffice

8 Approximate Counting (  = 1) k  0 for i = 1 to n do  if x i = 1, then with probability 1/2 k, set k  k + 1 output 2 k - 1 General idea:  Expected # of 1’s needed to increment k to k + 1 is 2 k  k = 0  k = 1: after seeing 1 one  k = 1  k = 2: after seeing 2 additional 1’s  k = 2  k = 3: after seeing 4 additional 1’s ……  k = i-1  k = i: after seeing 2 i-1 additional 1’s  Therefore, we expect k to become i after seeing … + 2 i-1 = 2 i – 1 1’s

9 Approximate Counting: Analysis For m = 0,…,H, let: K m = value of counter after seeing m 1’s. For i = 0,…,m, let p m,i = Pr[K m = i] Recursion:  p 0,0 = 1  p m,0 = 0, for m = 1,…,H  p m,i = p m-1,i (1 – 1/2 i ) + p m-1,i-1 1/2 i-1, for m = 1,…,H, i = 1,…,m-1  p m,m = p m-1,m-1 1/2 m-1, for m = 1,…,H

10 Approximate Counting: Analysis Define:C m = 2 K m Lemma: E[C m ] = m + 1 Therefore, C H - 1 is an unbiased estimator for H Can show that Var[C H ] is small, and therefore w.h.p. H/2 ≤ C H – 1 ≤ 2H. Proof of lemma: By induction on m.  Basis: E[C 0 ] = 1, E[C 1 ] = 2.  Suppose m ≥ 2 and E[C m-1 ] = m.

11 Approximate Counting: Analysis

12 Better Approximation So far, factor 2 approximation How do we obtain a 1+  approximation? k  0 for i = 1 to n do  if x i = 1, then with probability 1/(1 +  ) k, set k  k + 1 output ((1 +  ) k – 1)/ 

13 Distinct Elements [Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] Input: a vector x  {1,2,…,m} n Goal: find D = number of distinct elements of x  Example: if x = (1,2,3,1,2,3), then D = 3 Naïve solution: use a bit vector of size m, and track the values that appeared at least once  O(m) bits of space Can we do better? Answer 1: No!  If we want exact number, need  (m) bits of space Information theory gives only  (log m) Need communication complexity arguments Answer 2: Yes!  But only approximately:  Use only O(log m) bits of space

14 Estimating the Size of a Random Set Suppose we choose D << M 1/2 elements uniformly and independently from {1,…,M}:  X 1 is uniformly chosen from {1,…,M}  X 2 is uniformly chosen from {1,…,M} ……  X D is uniformly chosen from {1,…,M} For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X 1,…,X k }?  k = 1, we expect M/2 elements to be less than X 1  k = 2, we expect M/3 elements to be less than min{X 1,X 2 }  k = 3, we expect M/4 elements to be less than min{X 1,X 2,X 3 } ……  k = D, we expect M/(D+1) elements to be less than min{X 1,…,X D } Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown Then, if t = min S, we can estimate |S| as M/t – 1.

15 Distinct Elements, 1 st Attempt Let M >> m 2 Pick a random “hash function” h: {1,…,m}  {1,…,M},  h(1),…,h(m) are chosen uniformly and independently from {1,…,M}  Since M >> m 2, probability of collisions is tiny min  M for i = 1 to n do  if h(x i ) < min, min  h(x i ) output M/min

16 Distinct Elements: Analysis Space: O(log M) = O(log m)  Not quite. We’ll discuss this later. Correctness:  Let a 1,…,a D be the distinct values of x 1,…,x n  S = { h(a 1 ),…,h(a D ) } is a set of D random and independent elements from { 1,…,M }  Note: min = min S  Algorithm outputs M/(min S) Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.

17 Distinct Elements: Correctness Part 1: show Define for k = 1,…,D: Define: Note:

18 Markov’s Inequality X  0: a non-negative random variable t > 1 Then: Need to show: By Markov’s inequality,

19 Distinct Elements: Correctness Part 2: show Define for k = 1,…,D: Define: Note:

20 Chebyshev’s Inequality X: an arbitrary random variable > 0 Then: Need to show: By Chebyshev’s inequality, By independence of Y 1,…,Y D : Hence,

21 How to Store the Hash Function? How many bits needed to represent a random hash function h: [m]  [M]?  O(m log M) = O(m log m) bits  More than the naïve algorithm! Solution: use “small” families of hash functions  H will be a family of functions h: [m]  [M]  |H| = O(m c ) for some constant c  Each h  H, can be represented in O(log m) bits  Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently.  How do we make sure H has the “random-like” properties of totally random hash functions?

22 Universal Hash Functions [Carter, Wegman 79] H is a 2-universal family of hash functions if:  For all x  y  [m] and for all z,w  [M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M 2 Conclusions:  For each x, h(x) is uniform in [M]  For all x  y, h(x) and h(y) are independent  h(1),…,h(m) is a sequence of uniform pairwise- independent random variables k-universal families: straightforward generalization

23 Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field F m Each two elements a,b  F m will define one hash function in H  |H| = |F m | 2 = m 2 h a,b (x) = ax + b (operations in F m ) Note: if x  y  [m] and z,w  [m], then h a,b (x) = z and h a,b (y) = w iff Since x  y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m 2.

24 Distinct Elements, 2 nd Attempt Use a random hash function from a 2-universal family of hash functions rather than a totally random hash function Space:  O(log m) for tracking the minimum  O(log m) for storing the hash function Correctness:  Part 1: h(a 1 ),…,h(a D ) are still uniform in [M] Linearity of expectation holds regardless of whether Z 1,…,Z k are independent or not.  Part 2: h(a 1 ),…,h(a D ) are still uniform in [M] Main point: variance of pairwise independent variables is additive:

25 Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 +  approximation algorithm:  Find the t = O(1/  2 ) smallest elements, rather than just the smallest one.  If v is the largest among these, output tM/v Space: O(1/  2 log m)  Better algorithm: O(1/  2 + log m)

26 Frequency Moments [Alon, Matias, Szegedy 96] Input: a vector x  {1,2,…,m} n Goal: find F k = k-th frequency moment of x For each j  {1,…,m}, f j = # of occurrences of j in x  Ex: if x = (1,1,1,2,2,3) then f 1 = 3, f 2 = 2, f 3 = 1 Examples  F 1 = n (counting)  F 0 = distinct elements  F 2 = measure of “pairwise collisions”  F k = measure of “k-wise collisions”

27 Frequency Moments: Data Stream Algorithms F 0 : O(1/  2 + log m) F 1 : O(log log n + log(1/  )) F 2 : O(1/  2 (log m + log n)) F k, k > 2: O(1/  2 m 1-2/k )

28 End of Lecture 12