How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

College of Information Technology & Design
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
More Set Definitions and Proofs 1.6, 1.7. Ordered n-tuple The ordered n-tuple (a1,a2,…an) is the ordered collection that has a1 as its first element,
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
CSE115/ENGR160 Discrete Mathematics 02/28/12
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
CMPS1371 Introduction to Computing for Engineers SORTING.
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.
Dictionaries and Hash Tables1  
1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss.
CSE115/ENGR160 Discrete Mathematics 03/03/11 Ming-Hsuan Yang UC Merced 1.
Data Stream Mining and Querying
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Hash Tables1 Part E Hash Tables  
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Hash Table March COP 3502, UCF.
Data Structures and Algorithms Data Structures and Algorithms (CS210/ESO207/ESO211) Lecture 7 Data Structures Modeling versus Implementation Example: Abstract.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.
Lecture 4. RAM Model, Space and Time Complexity
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Extended Prelude to Programming Concepts & Design, 3/e by Stewart Venit and.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 9.
CS261 Data Structures Ordered Bag Dynamic Array Implementation.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Data Stream Algorithms Lower Bounds Graham Cormode
LIMITATIONS OF ALGORITHM POWER
Searching and Sorting Searching: Sequential, Binary Sorting: Selection, Insertion, Shell.
Searching Topics Sequential Search Binary Search.
CS 162 Intro to Programming II Insertion Sort 1. Assume the initial sequence a[0] a[1] … a[k] is already sorted k = 0 when the algorithm starts Insert.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Python Programing: An Introduction to Computer Science
Today’s Material Sorting: Definitions Basic Sorting Algorithms
Sorting & Searching Geletaw S (MSC, MCITP). Objectives At the end of this session the students should be able to: – Design and implement the following.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
MA/CSSE 473 Day 09 Modular Division Revisited Fermat's Little Theorem Primality Testing.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CSE15 Discrete Mathematics 03/06/17
New Characterizations in Turnstile Streams with Applications
Chapter 6 Transform-and-Conquer
Context-based Data Compression
Hash functions Open addressing
Lecture 18: Uniformity Testing Monotonicity Testing
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 10: Sketching S3: Nearest Neighbor Search
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Y. Kotidis, S. Muthukrishnan,
Range-Efficient Computation of F0 over Massive Data Streams
Algorithms Tutorial 27th Sept, 2019.
Presentation transcript:

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Problem Definition ► The Universe: U = {0, …, |U |-1} ► Number of records in data set: ||A||=N ► Data set can be thought of as an array: A[i] – number of records with value i ► A S – number of records with values in S ► The Ф-quantile of an ordered sequence of N data items are the value with rank ► Our goal is computing ε-approximate Ф-quantiles – find a j k such that:

Transactions ► Insert(i): A[i]  A[i] + 1 ► Delete(i): A[i]  A[i] – 1 ► Let ► ASSUME: The Universe size |U| is known

The Main Algorithmic Result ► The RSS Algorithm ► Space Complexity ► Update In every transaction in O(space) time ► Estimation On demand in O(space) time ► One Time pass

Dyadic Intervals ► Log(|U|)+1 resolution levels j ► 2|U|-1 Dyadic intervals I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

Arbitrary intervals ► Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals ► For example A[0,6] = I(1,0)+I(2,2)+I(3,6) ► Intervals starting at 0 will not use the same resolution twice I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

Computing quantiles ► Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. ► To compute the ф-quantile for any k, we need a j k s.t.: A[0,j k ) < kФN < A[0,j k+1 ) ► Use binary search to find it. ► Keeping all intervals is costly (O(|U|))

Random Subset Sums ► In case j = log(|U|) ► Let S be a subset of U ► Each u  U has p=½ of being in S ► E(|S|)= ½|U| ► Define: ► E(|A S |)=½||A||=½N

Estimating A[i]

Improvement ► Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions ► We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) ► Instead of keeping the sets keep a small representation of them

Pseudorandom set generator ► We need to keep a small representation of a random set S (Ui  S with p= ½) ► Given a seed of size log(|U|)+1 ► Represent a set S of size o(|U|) ► Quickly test if i  S or not ► Use Extended Hamming Code

Extended Hamming Code ► Given a seed, tells whether the i  S ► For example:  |U| = 8  Seed size: log|U|+1 = 4  G(seed, i) = seed X i’th column mod 2 ► Efficient to compute ► 3-wise disjoint

The Data Structure ► For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) ► Keep ► Maintain N = ||A|| ► We got S 1,…,S num_copies per level

Upon Transactions ► Insert(i) / Delete(i)  For Each resolution level j ► Locate the single I j,k into which i falls (high order binary bits) ► Determine all S ℓ containing I j,k ► For Each S ℓ increase/Decrease ||A S ℓ || by 1

Estimating Quantiles: Dyadic Intervals ► Given a dyadic interval I=I j,k ► There are num_copies sets of resolution j G E G E ► Quickly test each S ℓ and check if I  S ℓ and if so estimate ► Group all estimations into G groups of E elements ► For each group g calculate the average of all estimations A g,j,k

Estimating Quantiles: Arbitrary intervals ► Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals I j,k ► Form G groups and calculate each group’s sum of all dyadic interval’s A g,j,k for all I j,k comprising I. ► Take the median of all G groups as the final estimate of A I ► Its more convenient to refer to the result as an overestimate |A I |≤|A I | ~ ≤|A I |+εN

3 dyadic intervals E = 4 Elements per group G = 3 Groups SUM AVERAGE MEDIAN The Interval’s Estimate

Analysis ► Lemma: The algorithm estimates each quantile to within εN with p>1-δ ► Proof:  For a fixed resolution level j, Let  Then:

Analysis (cont.)

► We take G copies of Z and take the median. ► By the Chernoff inequality, ► The binary search looked for a j k such that ► We made log|U| checks in the binary search ► The probability any of them failed is log|U| times what we achieved, i.e δ

RSS Properties ► The algorithm may return a quantile value which was not seen in the input ► Changing the order of insertions and deletions doesn’t affect results ► The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets

Extension: U is unknown ► Predict a range [0, u-1] for U. ► Upon insertion of i > u-1, add another instance of RSS with range [u, u 2 -1], and so on… ► Because RSS is composable, we only have to join the result upon query ► Increased cost factor: log 2 log(|U|).

Experiments ► What is the median length of all active AT&T calls ? ► When call  Starts: Add timestamp  Ends: Delete start timestamp ► 4 KB used for RSS ► Compared  RSS  GK  GK2

Number of Active Phone Calls Over Time

Error in Computation of Median Over Time

Average Error for Last 50 Snapshots, For Deciles

The End