Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Slides:

Advertisements

Similar presentations

COMP9314Xuemin Continuously Maintaining Order Statistics Over Data Streams Lecture Notes COM9314.

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)

QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.

Chapter 4: Trees Part II - AVL Tree

Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efﬁcient Computation of Equi-Depth Histograms.

Introduction to Algorithms

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

COMP 451/651 Indexes Chapter 1.

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

2-dimensional indexing structure

Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

10 -1 Chapter 10 Amortized Analysis A sequence of operations: OP 1, OP 2, … OP m OP i : several pops (from the stack) and one push (into the stack)

MBG 1 PODS 04, June 2004 Power Conserving Computation of Order-Statistics over Sensor Networks Michael B. Greenwald & Sanjeev Khanna Dept. of Computer.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Priority Queues1 Part-D1 Priority Queues. Priority Queues2 Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is.

5.9 Heaps of optimal complexity

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

CS4432: Database Systems II

Sorting in Linear Time Lower bound for comparison-based sorting

1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.

Database Management 9. course. Execution of queries.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.

COSC 2007 Data Structures II Chapter 15 External Methods.

TECH Computer Science Problem: Selection Design and Analysis: Adversary Arguments The selection problem >  Finding max and min Designing against an adversary.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

Data Structure & Algorithm II.  In a multiuser computer system, multiple users submit jobs to run on a single processor.  We assume that the time required.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

Foundations of Data Structures Practical Session #8 Heaps.

Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy.

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Internal and External Sorting External Searching

Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Dynamic Dictionaries Primary Operations:  get(key) => search  put(key, element) => insert  remove(key) => delete Additional operations:  ascend()

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Offering a Precision- Performance Tradeoff for Aggregation Queries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi.

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

1 Chapter 8-1: Lower Bound of Comparison Sorts. 2 About this lecture Lower bound of any comparison sorting algorithm – applies to insertion sort, selection.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Frequency Counts over Data Streams

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Priority Queues © 2010 Goodrich, Tamassia Priority Queues 1

Spatial Online Sampling and Aggregation

Part-D1 Priority Queues

Ch. 8 Priority Queues And Heaps

(edited by Nadia Al-Ghreimil)

B+-Trees j a0 k1 a1 k2 a2 … kj aj j = number of keys in node.

Approximation and Load Shedding Sampling Methods

CO4301 – Advanced Games Development Week 4 Binary Search Trees

CS 6310 Advanced Data Structure Wei-Shian Wang

Presentation transcript:

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery

Outline Introduction The summary data structure Operation and algorithm Tree representation Analysis and experimental result Conclusion

Introduction Space-efficient computation of quantile summaries of very large data sets in a single pass. Quantile queries: Given a quantile, , return the value whose rank is  N 

t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t sorting 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 N = quantile returns element ranked 8 ( 0.5*16) which is quantile returns element ranked 12 (0.75*16) which is 10

Requirements Explicit & tunable a priori guarantees on the precision of the approximation As small a memory footprint as possible Online: Single pass over the data Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations. Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).

ε- approximate A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN, r + εN ] Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]

The Summary Data Structure Let r min (v) and r max (v) denote the lower and upper bounds on the rank of v Each tuple ti = (v i, g i,Δ i )

Example .01, N= {15,2} 201 {28,7} 204 {10,1} [501,503] [529,536] [539,540]

Query Sketch S is ε- approximate, That is for each ψ (0,1], there is a (v i, r min ( v i ), r max ( v i ) ) in S such that v i is our answer for ψ-quantile

Corollary If at any time n, the summary S(n) satisfies the property that then we can answer any ψ-quantile query to within an εn precision.

Overview of Summary Data Structure Quantile  =.29? Compute r and choose best v i 192 [501,503] {15,2} 201 [529,536] {28,7} .01, N= [539,540] { 10,1 }  =.29 r =  N = 522

Overview of Summary Data Structure If (r max (v i+1 ) - r min (v i )) ≦ 2  N, then  - approximate summary. Our goal: always maintain this property. Tuple formulation of this rule: g i +  I ≦ 2  N 192 [ 501,503 ] {15,2} 201 [529,536] {28,7} .01, N= [539,540] {10,1} 2  N=36

Overview of Summary Data Structure Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary 192 [501,503] {15,2} 201 [529,536] {28,7} .01, N= [539,540] {10,1}  N=36

Overview of Summary Data Structure 192 [501,503] {15,2} 201 [529,536] {28,7} .01, N= [539,540] {10,1} 197 [502,536] 2  N=36 Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary

Overview of Summary Data Structure Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary Insert tuple before the ith tuple. g new = 1;  new = g i +  I - 1; 192 [501,503] {15,2} 201 [530,537] {28,7} .01, N= [540,541] {10,1} 197 [502,536] 2  N=36.02 {1,34}

Overview of Summary Data Structure Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary Delete all “superfluous” entries. 192 [501,503] {15,2} 201 [530,537] {28,7} .01, N= [540,541] {10,1} 197 [502,536] 2  N=36.02 {1,34}

Overview of Summary Data Structure Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary Delete all “superfluous” entries. 192 [501,503] {15,2} 201 [530,537] { 28,7 } .01, N= [540,541] {10,1} 2  N=36.02 {1,34}

Overview of Summary Data Structure Goal: always maintain  -approximate summary (r max (v i+1 ) - r min (v i )) = (g i +  I ) ≦ 2  N Insert new observations into summary Delete all “superfluous” entries. g i = g i + g i [501,503] {15,2} 201 [530,537] {29,7} .01, N= [540,541] {10,1} 2  N=36.02

Overview of Summary Data Structure Insert: g new = 1;  new = g i +  I - 1; Delete: g i = g i + g i [501,503] {15,2} 201 [530,537] {29,7} .01, N= [540,541] {10,1} 2  N=36.02

Terminology Full tuple: A tuple is full if g i +  I = 2  N Full tuple pair: A pair of tuples is full if deleting the left-hand tuple would overfill the right one Capacity: number of observations that can be counted by g i before the tuple becomes full. (= 2  N -  I ) General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.

Operations Insert(v) ： Find the smallest i, such that, and insert Delete(v i ) ： to delete from S, replace and by the new tuple Compress() ： from right to left, merge all mergeable pair.

GK Algorithm To add the n+1st observation, v, to summary S(n) yes no COMPRESS()INSERT

Tree Representation .001, N=7,000 2  N=14  -range CapacityBand Group tuples with similar capacities into bands First (least index) node to the right with higher capacity band becomes parent.

Tree Representation .001, N=7,000 2  N=14  -range CapacityBand Group tuples with similar capacities into bands First (least index) node to the right with higher capacity band becomes parent

Tree Representation .001, N=7,000 2  N=14  -range CapacityBand Group tuples with similar capacities into bands First (least index) node to the right with higher capacity band becomes parent

Tree Representation .001, N=7,000 2  N=14  -range CapacityBand Group tuples with similar capacities into bands First (least index) node to the right with higher capacity band becomes parent R

Operation (compress) General strategy: delete tuples with small capacity and preserve tuples with large capacity. 1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees 2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity. 3) Deletion cannot create an over-full tuple (i.e with g+  > floor(2  N))

Analysis Theorem At any time n, the total number of tuples stored in S(n) is at most

Experimental Result Measurement: |S| Observed  (vs. desired  ) : max, avg, and for 16 representative quantiles Optimal max observed  Compared 3 algorithms MRL Preallocated (1/3 number of stored observations as MRL) Adaptive: allocate a new quantile only when observed error is about to exceed desired 

Conclusion Better worst-case behavior than previous algorithms It does not require a priori knowledge of the parameter N

Any Question ?