What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Distribution and Revocation of Cryptographic Keys in Sensor Networks Amrinder Singh Dept. of Computer Science Virginia Tech.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
Noga Alon Institute for Advanced Study and Tel Aviv University
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2009 Lecture 3 Tuesday, 2/10/09 Amortized Analysis.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Lecture 10: Search Structures and Hashing
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
Randomized Algorithms - Treaps
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
Alternative Wide Block Encryption For Discussion Only.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Clustering Data Streams A presentation by George Toderici.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
The Variable-Increment Counting Bloom Filter
RE-Tree: An Efficient Index Structure for Regular Expressions
Streaming & sampling.
Probabilistic Robotics
Lecture 18: Uniformity Testing Monotonicity Testing
Randomized Algorithms CS648
Indexing and Hashing Basic Concepts Ordered Indices
Approximate Frequency Counts over Data Streams
Dynamic Graph Algorithms
By: Ran Ben Basat, Technion, Israel
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by Tal Sterenzy

Motivation A basic statistic on database relationship is which items are hot – occur frequently Dynamically maintaining hot items in the presence of delete and insert transactions. Examples: DBMS – keep statistics to improve performance Telecommunication networks - network connections start and end over time

Overview Definitions Prior work Algorithm description & analysis Experimental results Summery

Formal definition Sequence of n transactions on m items [1…m] - Net occurrence of item i at time t The number of times it has inserted minus the times it has been deleted - current frequency of item at time t - most frequent item at time t The k most frequent items at time t are those with the k largest

Finding k hot items k is a parameter Item i is an hot item if Frequent items that appear a significant fraction of the entire dataset There can be at most k hot items, and there can be none Assume basic integrity constraint

Our algorithm highly efficient, randomized algorithm for maintaining hot items in a dynamically changing database monitors the changes to the data distribution and maintains O(klogklogm) When queried, we can find all hot items in time O(klogklogm) with probability 1-δ No need to scan the underlying relation

Small tail assumption Restriction: are the frequencies of items A set of frequencies has a small tail if If there are k hot items  then small tail probability holds If small tail probability holds  then some top k items might not be hot We shall analyze our solution in the presence and absence of this small tail property (STP)

Prior work – why is it not adaptable? All these algorithms hold counters: incremented when the item is observed decremented or reallocated under certain circumstances Can’t directly adapt these algorithms for insertions and deletions: the state of the algorithm is different to that reached without the insertions and deletions of the item. Work on dynamic data is sparse, and provide no guarantees for the fully dynamic case with deletions

Our algorithm - idea Do not keep counters of individual items, but rather of subsets of items Ideas from group testing: Design a number of tests, each of which group together a number of m items in order to find up to k items which test positive Here: find k items that are hot Minimize number of tests, where each group consists of a subset of items

General procedure For each transaction on item i, determine which subsets it is included in: S(i) Each subset has a counter: For insertion: increment all S(i) counters For deletion: decrement all S(i) counters The test will be: does the counter exceed a threshold Identifying the hot items is done by combining test results from several groups

The challenge is choosing the subsets Bounding the number of required subsets Finding concise representation of the groups Giving efficiant way to go from results of tests to the sets of hot items Lets start with a simple case: k=1 (freq>1/2)  Deterministic algorithm for maintaining majority item

Finding majority item For insertions only, constant time and space Keep logm+1 counters: 1 counter of items “alive”: The rest are labeled,one per group Each group represents a bit in the binary representation of the item Each group consists of half of the items

Finding majority item – cont. bit(i,j) – reports value of jth bit in binary representation of i gt(i, j) – return 1 if i>j, 0 otherwise Scheme: Insertion of item i: Increment each counter such that bit( i, j ) = 1 in time O(logm). Deletion of i: Decrement each counter such that bit(i, j) = 1 in time O(logm). Query: If there is a majority, then it is given by computed in time O(logm).

Finding majority item – cont. Theorem: The algorithm finds a majority item if there is one with time O(logm) per operation The state of the data structure is equivalent if there are I insertion and D deletions, or if there are c = I - D insertions In case of insertions only: the majority is found

UpdateCounters procedure int c[0…logm] UpdateCounters( i,transtype,c[0…logm] ) c[0]=c[0] + diff for j=1 to logm do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)

FindMajority procedure FindMajority(c[0... log m]) Position = 0, t =1 for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* t return position

Randomized constructions for finding hot items Observation: If we select subsets with one hot item exactly  applying the majority algorithm will identify the hot item Definition:

How many subsets do we need? Theorem: Picking O(k logk) subsets by drawing m/k items uniformly from [1…m] means that with constant probability we have included k good subsets S 1 …S k such that Proof: p – pick one item from F O(k logk) subsets will guarantee with constant probability that we have one of each hot item (coupon’s collector problem)

Coupon collector problem p is probability that coupon is good X – number of trials required to collect at least one of each type of coupon Epoch i begins with after i-th success and ends with (i+1)-th success X i – number of trials in the i-th epoch X i distributed geometrically and p i = p(k-i)/k

Defining the groups with universal hash functions The groups are chosen in a pseudo-random way using universal hash functions: Fix prime P > 2k a, b are drawn uniformly from [0…P-1] Then set: Fact: Over all choices of a and b, for x<>y:

Choosing and updating the subsets We will choose T = logk/δ values of a and b, Which creates 2kT= 2klogk/δ subsets of items Processing an item i means: To which T sets i belongs? For each one: update logm counters based on bit representation of i If the set is good, this gives us the hot item

Space requirements a and b are O(m): O(logk/δ logm) Number of counters: 2k logk/δ (logm + 1) Total space: O(k logk/δ logm) log(k/ δ) choices of a,b 2k subsets log m + 1 counters

Probability of each hot item being in at least one good subset is at least 1-δ Consider one hot item: For each T repetitions we put it in one of 2k groups The expected total frequency of other items: If f<1/(k+1)  majority will be found  success If f>1/(k+1)  majority can’t be found  failure Probability of failure < ½ (by Markov inequality) Probability to fail on each T < Probability of any hot items failing at most δ.

Detecting good subsets Given a subset and it’s associated counters, it is possible to detect deterministically whether the subset is a good subset Proof: a subset can fail in two cases: No hot items (assuming STP) : then More than one hot item: there will be j such that:  a good subset is determined

ProcessItem procedure Initialize c[0 … 2Tk][0 … log m] Draw a[1 … T], b[1 … T], c=0 ProccessItem(i,transtype,T,k) if (trans = ins) then c = c + 1 else c = c – 1 for x = 1 to T do index =2k(x-1)+(i*a[x]+b[x]modP)mod2k UpdateCounters(i,transtype,c[index])

GroupTest procedure GroupTest(T,k,b) for i=1 to 2Tk do if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if (c[i][j] > cb and c[i][0] – c[i][j] > cb) then Skip to next i if c[i][j] > cb position += t t = 2 * t output position

Algorithm correctness With probability at least 1-δ, calling the GroupTest(logk/δ,k,1/k+1) procedure finds all hot items. Time processing item is: O(logk/δ logm) Time to get all hot items is O(k logk/δ logm) With or without STP, we are still guarenteed to include all hot items with high probability Without STP, we might output infrequent items

Algorithm correctness – cont. When will an infrequent item be output? (no STP) A set with 2 hot items or more will be detected A set with one hot item will never fault. Even if there is a split without the hot item that exceeds the threshold – it will be detected A set with no hot item, and for all logm splits one half will exceed the threshold and the other not  only then the algorithm will fail

Algorithm properties The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1)) Proof: in the proof of probability for k hot items:

Experiments GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms. The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist. The recall is the proportion of the hot items that are found by the method to the total number of hot items. The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items.

Synthetic data (Recall) Zipf for hot items: 0 – distributed uniformly, 3 – highly skewed

Synthetic data (Precision) Zipf for hot items: 0 – distributed uniformly, 3 – highly skewed

Real data (Recall) Real data was obtained from one of AT&T network for part of a day.

Real Data (Percision) Real data has no guarantee of having small tail property

Varying frequency at query time The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%

Conclusions and extensions New method which can cope with dynamic dataset is proposed. It’s interesting to try to use the algorithm to compare the differences in frequencies between different datasets. Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?