Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

Slides:



Advertisements
Similar presentations
Hashing.
Advertisements

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
CS252: Systems Programming Ninghui Li Program Interview Questions.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Data Structures Using C++ 2E
IP Routing Lookups Scalable High Speed IP Routing Lookups.
© 2004 Goodrich, Tamassia Hash Tables1  
CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Heavy hitter computation over data stream
Data Structures Hash Tables
Hashing (Ch. 14) Goal: to implement a symbol table or dictionary (insert, delete, search)  What if you don’t need ordered keys--pred, succ, sort, select?
Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Data Structures Using Java1 Chapter 8 Search Algorithms.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Data Structures Using Java1 Chapter 8 Search Algorithms.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Data Structures Using C++ 2E Chapter 8 Queues. Data Structures Using C++ 2E2 Objectives Learn about queues Examine various queue operations Learn how.
1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Sets, Maps and Hash Tables. RHS – SOC 2 Sets We have learned that different data struc- tures have different advantages – and drawbacks Choosing the proper.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
CS201: Data Structures and Discrete Mathematics I Hash Table.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2004 Simonas Šaltenis E1-215b
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
A Introduction to Computing II Lecture 11: Hashtables Fall Session 2000.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
Hash Tables and Hash Maps. DCS – SWC 2 Hash Tables A Set and a Map are both abstract data types – we need a concrete implemen- tation in order to use.
2IS80 Fundamentals of Informatics Fall 2015 Lecture 6: Sorting and Searching.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Data Structures Using C++ 2E
Frequency Counts over Data Streams
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Data Structures Using C++ 2E
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS210- Lecture 5 Jun 9, 2005 Agenda Queues
Presentation transcript:

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU Khitron Igal – Finding Frequent Elements1

Overview Introduction Agenda Pass 1 Pass 1 implementation Pass 2 Summary Khitron Igal – Finding Frequent Elements2

Introduction Khitron Igal – Finding Frequent Elements3

Motivation Network congestion monitoring. Data mining. Analysis of web query logs.... Finding high frequency elements in a multiset, so called “Iceberg query”, or “Hot list analysis” Khitron Igal – Finding Frequent Elements4

On-line vs. off-line On-line algorithm is one that can work without saving all the input. It is able to treat each input element at arrival (stream oriented). In contrast, off-line algorithm needs a place to save all the input (bag oriented) Khitron Igal – Finding Frequent Elements5

Performance  Because of huge amount of data it is really important to reduce the time and space demands.  Preferable on-line analysis – one pass. Performance criteria: Amortized time (time for a sequence divided by its length). Worst case time (on-line only, time for symbol occurrence, maximized over all occurrences in the sequence). Number of passes. Space Khitron Igal – Finding Frequent Elements6

Passes If we can’t be satisfied in one on-line pass, we’ll use more. But in many problems, there should be minimal passes number. We still will not save all the input. For example, the Finding Frequent Elements problem on whole hard disk. To save the time it will be much better to make each algorithm pass using single reading head route an all the disk Khitron Igal – Finding Frequent Elements7

Problem Definitions Khitron Igal – Finding Frequent Elements8

History N.A LON, Y.M ATIAS, and M.S ZEGEDY (1996) proposed an algorithm which calculates a few highest frequencies without identifying the corresponding characters in one pass on-line. Attempts to find the forth or further highest frequencies need dramatically growing time and space and become not profitable Khitron Igal – Finding Frequent Elements9

Overview Introduction Agenda Pass 1 Pass 1 implementation Pass 2 Summary Khitron Igal – Finding Frequent Elements10

Space bounds Proposition 1. |I(x, θ)| ≤ 1/θ. Indeed, otherwise there are more than 1/θ * θN = N occurrences of all symbols from I(x, θ) in the sequence. Proposition 2. The on-line algorithm, which uses O(n) memory words is straightforward – just saving counters for each alphabet character. Theorem 3: Any one pass on-line algorithm needs in the worst case Ω(nlog(N / n)) bits. The proof will come later Khitron Igal – Finding Frequent Elements11

Algorithm specifications So we’ll need much more than 1/θ space for on-line one pass algorithm. We’ll present an algorithm, which: Uses O(1/θ) space. Makes two passes. O(1) per symbol occurrence runtime, including worst case.  The first pass will create a superset K of I(x, θ), |K| ≤ 1/θ, with possible false positives.  The second pass will find I(x, θ) Khitron Igal – Finding Frequent Elements12

Overview Introduction Agenda Pass 1 Pass 1 implementation Pass 2 Summary Khitron Igal – Finding Frequent Elements13

Pass 1 – Algorithm Description Khitron Igal – Finding Frequent Elements14

Pass 1 – the code Khitron Igal – Finding Frequent Elements15 Generalizing on θ: x[1]... x[N] is the input sequence K is a set of symbols initially empty count[] is an array of integers indexed by K for i := 1,..., N do {if (x[i] is in K) then count[x[i]] := count[x[i]] + 1 else {insert x[i] in K set count[x[i]] := 1} if (|K| > 1/theta) then for all a in K do {count[a] := count[a] – 1 if (count[a] = 0) then delete a from K}} output K

Pass 1 – example x = aabcbaadccd θ = 0.35 N = 11 θN = 3.85 f x (a) = 4 > θN f x (b) = f x (d) = 2 < θN f x (c) = 3 < θN 1/θ ≈ 2.85 |K|≥3 Result: a (+), c (–) Khitron Igal – Finding Frequent Elements16

Pass 1 – proof Theorem 4: The algorithm computes a superset K of I(x, θ) with |K| ≤ 1/θ, using O(1/θ) memory and O(1) operations (including hashing operations) per occurrence in the worst case. Proof: Correctness by contradiction: Let’s assume there are more than θN occurrences of some a in x, and a is not in K now. So we removed these occurrences, but each time 1/θ occurrences were removed. So totally we removed more than θN * 1/θ = N symbols, but there are only N, a contradiction. |K| ≤ 1/θ from algorithm description. So, we need O(1/θ) space. For O(1) runtime let’s see the implementation Khitron Igal – Finding Frequent Elements17

Overview Introduction Agenda Pass 1 Pass 1 implementation Pass 2 Summary Khitron Igal – Finding Frequent Elements18

Hash Hash table maps keys to their associated values. Our collision treat: Chaining hash – each slot of the bucket array is a pointer to a double linked list that contains the elements that hashed to the same location. For example, hash function f(x) = x mod Khitron Igal – Finding Frequent Elements

Pass 1 implementation – try 1 K as hash; needs O(1/θ) memory. There are O(1) amortized operations per occurrence arrival: Constant number of operations per arrival without deletions; each deletion is charged by the token from its arrival. But: Not enough for the worst case bound. Conclusion: We need a more sophisticated data structure Khitron Igal – Finding Frequent Elements20

Pass 1 – implementation demands We have now a problem of Data Structures theory. We need to maintain: A set K. A count for each member of K. And to support: Increment by one the count of a given K member. Decrement by one the counts of all elements in the K, together with erasing all 0-count members Khitron Igal – Finding Frequent Elements21

Pass 1 – implementation K as hash remains as is. New linked list L. It’s p th link is a pointer to a double linked list l p of members of K that have count p. A double pointer from each element of l p to the corresponding hash element. A pointer from each element of l p to it’s “counter in L”. Deletions will be done by special garbage collector. K = {a, c, d, g, h}, cnt(a)=4, cnt(c)=cnt(d)=1, cnt(g)=3, cnt(h)= Khitron Igal – Finding Frequent Elements22 ga L K c d h h X...

Pass 1 – time Any symbol occurrence needs: O(1) time for hash operations. Constant number of operations:  to insert as first element of l 1  to find proper “counter copy” and move it from l p to l p+1  to create new “counter” at the end of L  to move the start of L forward. The deletions fit because of garbage collector usage, each time constant operations number Khitron Igal – Finding Frequent Elements23 ga L K c d h h...

Pass 1 – last try Fits time properly. But... space is O(1/θ + c), where c is length of L. Bad for, e.g., x = a N, so we need a small improvement: Empty elements of L are absent, each non-empty one has the length field to the preceding neighbor, which still needs O(1) time. So the maximal length of L is 1/θ, same as the size bound of K. Thus, needed space is O(1/θ) in the worst case Khitron Igal – Finding Frequent Elements24 (2)(1) ga L c K d h h (3) g a...

Overview Introduction Agenda Pass 1 Pass 1 implementation Pass 2 Summary Khitron Igal – Finding Frequent Elements25

Pass 2 – Algorithm description We have a superset K, |K| ≤ 1/θ. Pass 2 – calculate counters for the members of K only. Return only those fitting f x (a) > θN Khitron Igal – Finding Frequent Elements26

The proof Theorem 3: Any one pass on-line algorithm needs in the worst case Ω(nlog(N / n)) bits, when N > 4n > 16 / θ. (recall N >> n >> 1/ θ) Proof: We’ll show an example that needs such a space. Assume that at the middle of x no symbol still occurred θN times. We need to remember at this moment counters state of each symbol. Otherwise, we can’t distinguish two cases for some symbol: One in I, other missing one occurrence (recall equivalence classes) Khitron Igal – Finding Frequent Elements27

The proof – cont’d It seems like we need to remember all the n counters. But there is something better: Let’s create a set of all combinations and number them. So we need to remember only the number of current combination. We’ll find a minimum size of this set P and to save the current number we need log|P| space. Let’s derive the lower bound for |P| Khitron Igal – Finding Frequent Elements28

|P| Lower Bound Khitron Igal – Finding Frequent Elements29

Summary So, there was a simple two-pass algorithm for finding frequent elements in streams Khitron Igal – Finding Frequent Elements30