Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining Data Streams.
1 Today’s Material Medians & Order Statistics – Ch. 9.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Randomized Algorithms Tutorial 3 Hints for Homework 2.
Heavy hitter computation over data stream
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Dictionaries and Hash Tables1  
Ch. 7 - QuickSort Quick but not Guaranteed. Ch.7 - QuickSort Another Divide-and-Conquer sorting algorithm… As it turns out, MERGESORT and HEAPSORT, although.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
Lecture 10: Search Structures and Hashing
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Calculating frequency moments of Data Stream
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Sorting Networks Sorting.
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
Hash table CSC317 We have elements with key and satellite data
Streaming & sampling.
Counting How Many Elements Computing “Moments”
Spatial Online Sampling and Aggregation
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Randomized Algorithms CS648
Qun Huang, Patrick P. C. Lee, Yungang Bao
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Clustering.
Lu Tang , Qun Huang, Patrick P. C. Lee
Dynamically Maintaining Frequent Items Over A Data Stream
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology

Problem One: Missing Card  I take one from a deck of 52 cards, and pass the rest to you. Suppose you only have a (very basic) calculator and bad memory, how can you find out which card is missing with just one pass over the 51 cards?  What if there are two missing cards? 2

A data stream algorithm …  Makes one pass over the input data  Uses a small amount of memory (much smaller than the input data)  Computes something 3

Why do we need streaming algorithms  Networking  Often get to see the data once  Don’t want to store the entire data  Databases  Data stored on disk, sequential scans are much faster  Data stream algorithms have been a very active research area for the past 15 years  Problems considered today  Missing card  Majority  Heavy hitters 4

Problem two: Majority  Given a sequence of items, find the majority if there is one  A A B C D B A A B B A A A A A A C C C D A B A A A  Answer: A  Trivial if we have O(n) memory  Can you do it with O(1) memory and two passes?  First pass: find the possible candidate  Second pass: compute its frequency and verify that it is > n/2  How about one pass?  Unfortunately, no 5

Problem three: Heavy hitters  Problem: find all items with counts > φn, for some 0< φ<n  Relaxation:  If an item has count > φ n, it must be reported, together with its estimated count with (absolute) error < εn  If an item has count < (φ − ε) n, it cannot be reported  For items in between, don’t care  In fact, we will solve the most difficult case φ = ε  Applications  Frequent IP addresses  Data mining 6

The algorithm [Metwally, Agrawal, Abbadi, 2006]  Input parameters: number of counters m, stream S  Algorithm: for each element e in S { if e is monitored { find counter of e, counter i ; counter i ++; } else { find the element with least frequency, e m, denote its frequency min; replace e m with e; assign counter for e with min+1; } 7

Properties of the Algorithm  Actual count of a monitored item ≤ counter  Actual count of a monitored item ≥ counter – min  Actual count of an item not monitored ≤ min  Proof by induction  The sum of the counters maintained is n  Why?  So min <= n/m  If we set m = 1/ε, it’s sufficient to solve the heavy hitter problem  Why?  So the heavy hitter problem can be solved in O(1/ε) space 8

How about deletions?  Any deterministic algorithm needs Ω(n) space  Why?  In fact, Las Vegas randomization doesn’t help  Will design a randomized algorithm that works with high probability  For any item x, we can estimate its actual count within error εn with probability 1-δ for any small constant δ 9

The Count-Min sketch [Cormode, Muthukrishnan, 2003] 10 A Count-Min (CM) Sketch with parameters is represented by a two-dimensional array counts with width w and depth Given parameters, set and. Each entry of the array is initially zero. hash functions are chosen uniformly at random from a 2-univeral family For example, we can choose a prime number p > u, and random a j, b j, j=1,…,d. Define: h j (x) = (a j x + b j mod p) mod w Property: for any x ≠ y, Pr[h j (x)=h j (y)] ≤ 1/w

Updating the sketch 11 Update procedure : When item x arrives, set When item x is deleted, do the same except changing +1 to -1

Estimating the count of x 12 Theorem 1 actual countestimated count

Proof 13 We introduce indicator variables 1 if 0 otherwise Define the variable By construction,

14 For the other direction, observe that Markov inequality ■ So, the Count-Min Sketch has size