The Misra Gries Algorithm. Motivation Espionage The rest we monitor.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Order Statistics Sorted
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Randomized Algorithms Randomized Algorithms CS648 Lecture 2 Randomized Algorithm for Approximate Median Elementary Probability theory Lecture 2 Randomized.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Randomized Algorithms Randomized Algorithms CS648 Lecture 6 Reviewing the last 3 lectures Application of Fingerprinting Techniques 1-dimensional Pattern.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Heavy hitter computation over data stream
CPSC 411, Fall 2008: Set 6 1 CPSC 411 Design and Analysis of Algorithms Set 6: Amortized Analysis Prof. Jennifer Welch Fall 2008.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
COT 5405 Spring 2009 Discussion Session 3. Introduction Our weekly discussion class are like organized TA office hours. Here we encourage students to.
CSC 211 Data Structures Lecture 13
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Data Stream Algorithms Lower Bounds Graham Cormode
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Stochastic Streams: Sample Complexity vs. Space Complexity
Updating SF-Tree Speaker: Ho Wai Shing.
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
School of Computing Clemson University Fall, 2012
Streaming & sampling.
Outsourced Computation Verification
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

The Misra Gries Algorithm

Motivation Espionage The rest we monitor

Motivation Viruses and malware

Motivation Monitoring internet traffic

Problem 2 50 BPS We can't store the whole input so We seek methods which requires small space

Synopsis (Summary) Structures A small summary of a large data set that (approximately) captures some statistics/properties we are interested in. Data Set D Synopsis d

Synopsis: Desired Properties  Easy to add an element  Mergeable : can create summary of union from summaries of data sets  Easy to delete elements  Flexible: supports multiple types of queries

The Data Stream Model

What can we do easily? 32,112,14,9, 37,83,115,2,

What can we do easily?

What can not be done easily? Compute median.

In Pattern Matching Pattern P is given. Text T streams in. Dueling algorithm, FFT - not streaming algorithms. KMP ?  Needs O(|P|) space. (good)  May work O(|P|) time on every token. (not so good)

BUT… Rabin-Karp Algorithm - streaming algorithm.  Needs O(1) space. (excellent!)  Works O(1) time on every token. (excellent!)  But result only good with high probability!

The Quality of an Algorithm’s Answer Quality of approximationProbability of result

The Quality of an Algorithm’s Answer Quality of approximationProbability of result

Finding Frequent Items

Frequent Items: Exact Solution Exact solution:  Create a counter for each distinct token on its first occurrence  When processing a token, increment the counter 32,12,14,32, 7,12,32,7, 6,12,4,

Deterministic Streaming Algorithm for Finding Frequent Items The Misra-Gries algorithm [1982] finds these frequent elements deterministically, using O(c log m) bits, in two passes.

Frequent Elements: Misra Gries ,12,14,32, 7,12,32,7, 6,12,4,

Frequent Elements: Misra Gries ,12,14,32, 7,12,32,7, 6,12,4, This is clearly an under-estimate. What can we say precisely?

Algorithm Analysis Space: c counters of log m + log n bits, i.e. O(c(log m + log n)) Time: O(log c) per token. Output quality?

Algorithm Analysis

Proof

Finding Frequent Elements How do we verify that all elements in the counters are frequent? -- Another pass.