Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Hashing.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Mining Data Streams.
Topological Sort and Hashing
Noga Alon Institute for Advanced Study and Tel Aviv University
Priority Queues And the amazing binary heap Chapter 20 in DS&PS Chapter 6 in DS&AA.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss.
Lower and Upper Bounds on Obtaining History Independence Niv Buchbinder and Erez Petrank Technion, Israel.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
Data Structures Binomial Heaps Fibonacci Heaps Haim Kaplan & Uri Zwick
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Streaming & sampling.
Probabilistic Robotics
Randomized Algorithms CS648
Indexing and Hashing Basic Concepts Ordered Indices
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Dynamic Graph Algorithms
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Approximation and Load Shedding Sampling Methods
Lu Tang , Qun Huang, Patrick P. C. Lee
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.

Motivation Most DB management systems maintains “ hot items ” statistics. Hot items are used as simple outliers in data mining, anomaly detection in network applications and financial market data. The proposed algorithm handles deletions and insertions and with better qualities then other existing methods

Introduction Given: n integers in the range [1… m] So there may be up to k hot items and there may be none.

Preliminaries If we’re allowed O(m) space, then simple heap will process each insert and delete in O(log m) and find all the hot items in O(k logk). Lemma 1: Any algorithm which guarantees to find all and only items which have frequency greater then 1/(k+1) must store at Ω(m) bits.

Small Tail Property

Prior works AlgorithmType Time Per Item Space Misra-GriesDeterministicO(log k) amortized O(k) FrequentRandomizedO(1) expectedO(k) Lossy CountingDeterministicO(log(n/k))Ω(klog(n/k)) Charikar et al.RandomizedΩ(k/ε²logn) Gilbert et al. (quantiles) RandomizedΩ(k²log²n)

Base method Theorem 1: Calling DivideAndConquer(1,m,n/(k+1)) will output all and only hot items. A total of O(k log m/k) calls will be made to the oracle. DivideAndConquer(l,r,thresh) if oracle(l,r) > thresh if (l=r) then output(l) else DivideAndConquer(l,r-l/2,thresh) DivideAndConquer(r-l/2+1,r,thresh)

Oracle design N.Alon et al. - requires O(km) space Gilbert (Random Subset Sums) requires O(k²log m log k/δ) space Charikar et al. requires O(klogm log k/δ) Cormode requires O(k logm log k/δ)

Group Testing Idea: To design number of tests, each of which groups together a number of m items in order to find up to k items which test positive. Majority item can be found for insertions only case in O(1) time and space by the algorithm of Boyer and Moore. General procedure: for each transaction on item i we determine each subset it’s included in,S(i). And increment or decrement the counters associated with the subsets.

Deterministic algorithm Each test includes half of the range [1..m], corres- ponding to binary representations of values int c[0…log m] UpdateCounters(i, transtype, c[0…log m]) c[0]=c[0] + diff for j=1 to log m do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)

Deterministic algorithm(cont) Theorem 2: The above algorithm finds a majority item if there is one with time O(log m) per operation. FindMajority(c[0... log m]) Position = 0, t =1 for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* t return position

Randomized Algorithm

Coupon Collector Problem X – number of trials required to collect at least one of each type of coupon Epoch i begins with after i-th success and ends with (i+1)-th success Xi – number of trials in the i-th epoch Xi distributed geometrically and pi = p(k-i)/k p is probability that coupon is good

Using hash functions We don’t store sets explicitly – O(mlogk), instead we choose the set in pseudo-random fashion using hash functions: Fix a prime P > 2k, Take a and b uniformly from [0…P-1] Choose T= log k/δ pairs of a,b ; each pair will define 2k sets.

Data Structure log(k/ δ) groups 2k subsets log m counters

ProccessItem Initialize c[0 … 2Tk][0 … log m] Draw a[1 … T], b[1 … T], c=0 ProccessItem(i,transtype,T,k) for all (i,transtype) do if (trans = ins) c = c +1 else c = c – 1 for x = 1 to T do index = 2k(x-1) + (i*a[x]+b[x] mod P) mod 2k //they had 2(x-1) UpdateCounters(i,transtype,c[index]) Space used by the algorithm is O(k log(k/ δ) log m).

Lemma Lemma 2: The probability of each hot item being in at least one good set is at least 1- δ Proof: For each T repetitions we put a hot item in one of 2k buckets. If m < 1/(k+1) then there is a majority and we can find it If m > 1/(k+1) then we won’t be able to find a majority Probability of failure < ½ by Markov inequality. Probability to fail on each T is at most δ/k. Probability of any hot items failing at most δ.

GroupTest GroupTest(T,k,b) for i=1 to 2Tk do // they had T here if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if c[i][j] > cb and c[i][0] – c[i][j] > cb or c[i][j] < cb and c[i][0] – c[i][j] < cb then Skip to next i if c[i][j] > cb position = position + t t = 2 * t return position

Algorithm properties Theorem 4: With proability at least 1- δ, calling the GroupTest(log k/δ,k,1/(k+1)) procedure finds all the hot items using O(k log k/δ logm) space. The time for an update is O(logk/δ log m) and the time to list all hot items is O(k log k/δ log m) Corollary 1: If we have the small tail property, then we will output no items which are not hot. Proof: We will look for hot items only if it exists and only it will be output.

Algorithm properties (cont) Lemma 4: The output of the algorithm is the same for any reordering of the input data. Corollary 2: The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1) Proof: According to Lemma2, sinceLemma2

Experiments Definitions: The recall is the proportion of the hot items that are found by the method to the total number of hot items. The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items. GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms. The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist.

Synthetic data (Recall) Zipf for hot items: 0 – distributed uniformly, 3 – highly skewed

Synthetic data (Precision) GroupTesting required more memory and took longer to process each item.

Real Data (Recall) Real data was obtained from one of AT&T network for part of a day.

Real Data (Precision) Real data has guarantee of having small tail property….

Varying frequency at query time Other algorithms are designed around fixed frequency threshold supplied in advance. The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%

Conclusions and extensions New method which can cope with dynamic dataset is proposed. It ’ s interesting to try to use the algorithm to compare the differences in frequencies between different datasets. Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?

FIN