Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
MATH 224 – Discrete Mathematics
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
CSE 3101: Introduction to the Design and Analysis of Algorithms
Mudasser Naseer 1 5/1/2015 CSC 201: Design and Analysis of Algorithms Lecture # 9 Linear-Time Sorting Continued.
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Updated QuickSort Problem From a given set of n integers, find the missing integer from 0 to n using O(n) queries of type: “what is bit[j]
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
the fourth iteration of this loop is shown here
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ch. 7 - QuickSort Quick but not Guaranteed. Ch.7 - QuickSort Another Divide-and-Conquer sorting algorithm… As it turns out, MERGESORT and HEAPSORT, although.
Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
TDDB56 DALGOPT-D DALG-C Lecture 8 – Sorting (part I) Jan Maluszynski - HT Sorting: –Intro: aspects of sorting, different strategies –Insertion.
A survey on stream data mining
Algorithm Efficiency and Sorting
Selection: Find the ith number
Tirgul 4 Order Statistics Heaps minimum/maximum Selection Overview
10/15/2002CSE More on Sorting CSE Algorithms Sorting-related topics 1.Lower bound on comparison sorting 2.Beating the lower bound 3.Finding.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
The Best Algorithms are Randomized Algorithms N. Harvey C&O Dept TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAA.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.
Higher Grade Computing Studies 4. Standard Algorithms Higher Computing Software Development S. McCrossan 1 Linear Search This algorithm allows the programmer.
Standard Algorithms –search for an item in an array –count items in an array –find the largest (or smallest) item in an array.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Algorithms 1.Notion of an algorithm 2.Properties of an algorithm 3.The GCD algorithm 4.Correctness of the GCD algorithm 5.Termination of the GCD algorithm.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Sorting Algorithms Written by J.J. Shepherd. Sorting Review For each one of these sorting problems we are assuming ascending order so smallest to largest.
Intro. to Data Structures Chapter 7 Sorting Veera Muangsin, Dept. of Computer Engineering, Chulalongkorn University 1 Chapter 7 Sorting Sort is.
Sorting & Lower Bounds Jeff Edmonds York University COSC 3101 Lecture 5.
Clustering Data Streams A presentation by George Toderici.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
David Luebke 1 7/2/2016 CS 332: Algorithms Linear-Time Sorting: Review + Bucket Sort Medians and Order Statistics.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Mining Data Streams (Part 1)
Approximating Set Cover
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
Randomized Algorithms
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Randomized Algorithms
CSCI B609: “Foundations of Data Science”
CMPS 3130/6130 Computational Geometry Spring 2017
Lecture 6: Counting triangles Dynamic graphs & sampling
Compact routing schemes with improved stretch
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google Inc. Presented by Amir Rothschild

Presenting:  1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.  The algorithm achieves especially good space bounds for Zipfian distribution  2-pass algorithm for estimating the items with the largest change in frequency between two data streams.

Definitions:  Data stream:  where  Object o i appears n i times in S.  Order o i so that  f i = n i /n

The first problem:  FindApproxTop(S,k,ε) Input: stream S, int k, real ε. Output: k elements from S such that:  for every element O i in the output:  Contains every item with: n1n2nk

Clarifications:  This is not the problem discussed last week!  Sampling algorithm does not give any bounds for this version of the problem.

Hash functions  We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:

Let’s start with some intuition…  Idea:  Let s be a hash function from objects to {+1,-1}, and let c be a counter.  For each q i in the stream, update c += s(q i ) C S  Estimate ni=c*s(oi)  (since )

Realization s(O 1 )s(O 2 )s(O 2 )s(O 2 )s(O 3 )s(O 2 ) s1s1 +1 s2s2 +1 s3s3 s4s4 +1 E0 0

Claim:  For each element O j other then O i, s(O j )*s(O i )=-1 w.p.1/2 s(O j )*s(O i )=+1 w.p. 1/2.  So O j adds the counter +n j w.p. 1/2 and - n j w.p. 1/2, and so has no influence on the expectation.  O i on the other hand, adds +n i to the counter w.p. 1 (since s(O i )*s(O i )=+1)  So the expectation (average) is +n i.  Proof:

That’s not enough:  The variance is very high.  O(m) objects have estimates that are wrong by more then the variance.

First attempt to fix the algorithm…  t independent hash functions S j  t different counters C j  For each element qi in the stream: For each j in {1,2,…,t} do C j += S j (q i )  Take the mean or the median of the estimates C j *S j (o i ) to estimate n i. C1C3C2C4C5C6 S1S2S3S4S5S6

Still not enough  Collisions with high frequency elements like O 1 can spoil most estimates of lower frequency elements, as O k.

Ci The solution !!!  Divide & Conquer:  Don’t let each element update every counter.  More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table. Ti hi Si

Presenting the CountSketch algorithm… Let’s start working…

h1h2ht t hash tables b buckets T1 h1 S1 T2 h2 S2 Tt ht St CountSketch data structure

The CountSketch data structure  Define CountSkatch d.s. as follows:  Let t and b be parameters with values determined later.  h1,…,ht – hash functions O -> {1,2,…,b}.  T1,…,Tt – arrays of b counters.  S1,…,St – hash functions from objects O to {+1,-1}.  From now on, define : hi[oj] := Ti[hi(oj)]

The d.s. supports 2 operations:  Add(q):  Estimate(q):  Why median and not mean?  In order to show the median is close to reality it’s enough to show that ½ of the estimates are good.  The mean on the other hand is very sensitive to outliers.

Finally, the algorithm:  Keep a CountSketch d.s. C, and a heap of the top k elements.  Given a data stream q 1,…,q n :  For each j=1,…,n: C.Add(q j ); If qj is in the heap, increment it’s count. Else, If C.Estimate(q j ) > smallest estimated count in the heap, add q j to the heap. (If the heap is full evict the object with the smallest estimated count from it)

And now for the hard part: Algorithms analysis

Definitions

Claims & Proofs

The CountSketch algorithm space complexity:

Zipfian distribution Analysis of the CountSketch algorithm for Zipfian distribution

Zipfian distribution  Zipfian(z): for some constant c.  This distribution is very common in human languages (useful in search engines).

Pr q (oi=q)

Observations  k most frequent elements can only be preceded by elements j with n j > (1-ε)n k  => Choosing l instead of k so that n l+1 <(1-ε)n k will ensure that our list will include the k most frequent elements. n1n1 n2n2 nknk n l+1

Analysis for Zipfian distribution  For this distribution the space complexity of the algorithm is where:

Proof of the space bounds: Part 1, l=O(k)

Proof of the space bounds: Part 2

Comparison of space requirements for random sampling vs. our algorithm

Yet another algorithm which uses CountSketch d.s. Finding items with largest frequency change

The problem  Let be the number of occurrences of o in S.  Given 2 streams S1,S2 find the items o such that is maximal.  2-pass algorithm.

The algorithm – first pass  First pass – only update the counters:

The algorithm – second pass  Pass over S1 and S2 and:

Explanation  Though A can change, items once removed are never added back.  Thus accurate exact counts can be maintained for all objects currently in A.  Space bounds for this algorithm are similar to those of the former with replaced by