From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efﬁcient Computation of Equi-Depth Histograms.

Slides:

Advertisements

Similar presentations

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Fast Algorithms For Hierarchical Range Histogram Constructions

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Introduction to Histograms Presented By: Laukik Chitnis

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Mining Data Streams.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.

Optimal Workload-Based Weighted Wavelet Synopsis

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

A survey on stream data mining

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Histograms for Selectivity Estimation

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Monitoring k-NN Queries over Moving Objects Xiaohui Yu University of Toronto Joint work with Ken Pu and Nick Koudas.

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Kijung Shin Jinhong Jung Lee Sael U Kang

June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Dense-Region Based Compact Data Cube

Clustering Data Streams

Frequency Counts over Data Streams

The Stream Model Sliding Windows Counting 1’s

A paper on Join Synopses for Approximate Query Answering

Data-Streams and Histograms

Objective of This Course

Range-Efficient Computation of F0 over Massive Data Streams

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Heavy Hitters in Streams and Sliding Windows

Approximation and Load Shedding Sampling Methods

Learning from Data Streams

Presentation transcript:

From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efﬁcient Computation of Equi-Depth Histograms for Data Streams. By: Hamid Mousavi and Carlo Zaniolo Computer Science Department, UCLA University of California, Los Angeles Computer Science Department

Outline Introduction ◦ Histograms vs. Quantiles Exponential Histogram structure BASH Algorithm Experimental results Conclusion BASH by H. Mousavi and C.Zaniolo2

Histograms Equi-width: all buckets have the same width, and the number of items in each bucket is reported. Equi-depth, a.k.a equi-height or equi-probable: All buckets must contain the same number of items and we must specify boundaries between buckets to achieve that. Computationally very challenging. No previous work of equi- depth histograms on windows. Previous work only considered the related problem of Quantiles. 3H. Mousavi and C. Zaniolo

Quantiles A problem akin to equi-depth histograms; ◦ Query: find the item that occupies a given position in a sorted list of N items. ◦ Thus given a ϕ, 0 ≤ ϕ ≤ 1, that describes the scaled rank of an item in the list, the quantile algorithm must return the ⌈ ϕ N ⌉ -th item in the list. Quantile algorithms guarantee an -approximate answer for any given ϕ ◦ [Greenwald et al. 2001]: Proposed an algorithm called GK for the whole history of the stream. ◦ [Arasu et al. 2004]: used GK to solve the problem on sliding windows. We refer to this algorithm as AM. ◦ [Lin et al. 2004]: also used GK for the sliding window case; however, with higher space usage. 4H. Mousavi and C. Zaniolo

Quantiles as Equi-depth Histograms To compute the B - 1 boundaries of any equi-depth histograms, we could employ a quantile structure and report ϕ -quantiles for ϕ = 1/B, ϕ = 2/B, … However, on data streams: ◦ Quantile algorithms over sliding windows are too slow, since they derive more information than is needed to derive boundaries, ◦ Quantiles algorithms focus on minimizing the rank error, while in equi-depth histograms we want to minimize the bucket size error. 5H. Mousavi and C. Zaniolo

The BASH Algorithm: Goals and Contributions BASH constructs equi-depth histograms ◦ For data streams over sliding windows ◦ A faster algorithm ◦ Producing more accurate results ◦ Using a reasonable amount of memory This kind of histograms have many applications, e.g.: ◦ Query optimization, ◦ Approximate query answering, ◦ Distribution fitting ◦ Parallel database partitioning BASH builds on the EH Sketch [Datar, Gionis, Indyk, Motwani: SIAM J. Comput. 2002] 6

The EH Sketch Counting the number of 1’s in a 0-1 stream Is trivial for the whole data stream, but Over a sliding windows we must memorize the whole window to detect expiring items, … or Use approximation to reduce memory usage ◦ EH Sketches require: ◦ O(1/ δ log(W)) memory for δ -approximation on a window of size W--i.e. the error is guarantee to be less than δ. ◦ Sketch contains boxes of size 2 0, 2 1 … 2 j ◦ At most K/2+ 1 (or K/2+2) boxes for each size j, where K=1/ δ : ◦ Otherwise the boxes are merged into a larger box. 7H. Mousavi and C. Zaniolo

Adding 1,0,1 to an EH Sketch k=2; k/2+2=3 count 8H. Mousavi and C. Zaniolo Time-Stamp

BAr-Splitting Histograms (BASH) Instead of counting the 1 ’s in a 0-1 stream, count the number of values in interval [ B i, B i+1 ) For a Histograms with B buckets ◦ For accuracy we need p×B intervals ◦ p is the expansion factor ~ 6 or 7 ◦ An interval will be called a bar. 9H. Mousavi and C. Zaniolo

General Idea: the number of items in each bar is approx. W/(p×B) … … B min=B 0 B max B1B1 B2B2 B3B3 Bar B4B4 B5B5 10H. Mousavi and C. Zaniolo Imaginary Ordered List of Items …

Initialization Phase BASH starts with an empty bar As items arrive bars, grow up to size: Coef × W × 1/S m where: ◦ W is the window size (i.e., number of items) ◦ S m is the number of bars: ≤ p×B ◦ Coef ≅ 1.7 H. Mousavi and C. Zaniolo11

Inserting a New Item N: Find i-th bar where B i ≤N<B i + 1 Find i-th bar where B i ≤N<B i + 1 Insert N provided that size of bar ≤ Coef ×W/(B × p). … … B 0 =minB1B1 B2B2 B3B3 B4B4 BiBi B i+1 B max 12H. Mousavi and C. Zaniolo … B i+2 N

Inserting an New Item in the i th bar … ≤ … B 0 =minB1B1 B2B2 B3B3 B4B4 BiBi B i+1 When the size of i th bar is larger than Coef ×W/(B × p) we need split it. But if we have already B × p bar, then we must first make room for the new bar, by merging two existing bars B max 13H. Mousavi and C. Zaniolo …

Merging two bars We only merge when we need to make some room. … … B 0 =minB1B1 B2B2 B3B3 B4B4 BiBi B i+1 Blocked Bar We never add to a blocked bar. We only remove old blocks until it is empty. B max 14H. Mousavi and C. Zaniolo …

Splitting a bar … … B 0 =minB1B1 B2B2 B3B3 B i-1 B i+1 BiBi B max 15H. Mousavi and C. Zaniolo …

Divide an EH structure into Two It’s easy to show that the resulting structures are EHs as well 16H. Mousavi and C. Zaniolo

Alternative Merge Function (BASH-AL) For interval For interval New interval H. Mousavi and C. Zaniolo Now, the original EH algorithm can be executed to merge extra boxes.

Computing Final Boundaries … B 0 =minB1B1 B2B2 B3B3 B i4 B max … R1R1 R2R2 R3R3 R B-1 We assume that data items are uniformly distributed in each bar. 18H. Mousavi and C. Zaniolo

Error definitions: -approximation An equi-depth histogram on a window of size W is a: size-based expected -approximate summary when the expected error on the reported number of items in each bucket is bounded by × W/B. rank-based expected -approximate if the expected error of the rank of every reported boundary is bounded by × W. boundary-based expected -approximate if the expected value error for every reported boundary is bounded by × S, where S is the diff between max and min values in W. 19H. Mousavi and C. Zaniolo

Theoretical Results: we proved that BASH-BL provides a size-based expected - approximate equi-depth summary Average time complexity: ◦ O(log(B×p)+(B×p)/S) per single entry ◦ S ≥1 is the size of the window-slide Memory usage: ◦ Bounded by O(log( 2 W/B) ×B/ 2 ) ◦ In practice, smaller than that. 20H. Mousavi and C. Zaniolo

Experimental results We have compared our algorithms with one of the best existing ones called AM algorithm. Both BASH-BL and BASH-AL 1.Are at least 4 times faster than AM, 2.provide more accurate results 3.While using approximately the same amount of memory 21H. Mousavi and C. Zaniolo

22H. Mousavi and C. Zaniolo

Running Time (for =0.01, W=100K, B=20) 23H. Mousavi and C. Zaniolo

Memory Usage 24H. Mousavi and C. Zaniolo

Rank Error 25H. Mousavi and C. Zaniolo

Size Error 26H. Mousavi and C. Zaniolo

Boundary Error 27H. Mousavi and C. Zaniolo

Effect of changing window size on the running time (DS13) 28H. Mousavi and C. Zaniolo

Effect of changing window size on the size error (DS13) 29H. Mousavi and C. Zaniolo

Boundary error for DS15 (Mix) data set 30BASH by H. Mousavi and C.Zaniolo

Boundary error for the extended S&P500 dataset 31BASH by H. Mousavi and C.Zaniolo

Conclusion and Future Works The BAr Splitting Histograms (BASH) compute expected sized-based -approximate equi-depth histogram. Moreover, ◦ There is no need to know the min and max of values ◦ Is more than 4 times faster than previous approaches ◦ While typically requires smaller memory footprints ◦ Provides more accurate results Future work ◦ Reducing memory even further. ◦ Other Types of Histograms: Biased Histograms Compressed Histograms: MaxDiff Histograms, V-Optimal Histograms 32BASH by H. Mousavi and C.Zaniolo

QUESTIONS? QUESTIONS? Thanks for listening BASH by H. Mousavi and C.Zaniolo33

References: A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, pages 286–296, M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794–1813, M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD Conference, pages 58–66, X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously maintaining quantile summaries of the most recent n elements over a data stream. In ICDE, pages 362–374, H. Mousavi and C. Zaniolo34

Fast computation of approximate biased histograms on sliding windows over data by: Hamid Mousavi and Carlo Zaniolo SSDBM 2013 (best paper award) Hamid Mousavi35UCLA, CSD, Winter 2014

Biased Synopses for A Better Estimation at Tails Probability of precipitation amounts* Hamid Mousavi36 * The data is taken from UCLA, CSD, Winter 2014

Biased Histograms Hamid Mousavi37 % ϕ (%90) % ϕ 2 (%81) % ϕ 3 (%73) % ϕ 4 (%65) % ϕ 5 (%59) % ϕ 6 (%53) % ϕ 7 (%48) Buckets We exponentially decrease the bucket size once approaching the biased point with the factor of ϕ. For the rest of the presentation ϕ is called the biased factor. UCLA, CSD, Winter 2014

Biased Histograms: many critical applications Network performance monitoring systems need to watch the round-trip time ( RTT) distribution with a biased interest over the tail of the RTT s to detect suspicious or malicious behaviors. Efficiently partitioning large datasets in which data items in the tail of distribution are more costly to handle. (Web Graph) Hamid Mousavi38UCLA, CSD, Winter 2014

Our Contributions: An accurate and efficient algorithm to maintain ε -approximate biased histograms on sliding windows over data streams, by biased sampling techniques to adjust to the memory and CPU requirements. Our technique is called Bar Splitting Biased Histogram or BSBH. Hamid Mousavi39UCLA, CSD, Winter 2014

Conclusion We formalized the concept of approximate Biased Histograms. We proposed a new algorithm for generating approximate Biased Histograms which: ◦ works efficiently for data streams with sliding windows (no previous work for that), ◦ outperforms previous approaches for the entire data streams, and ◦ adapts to memory and CPU requirements by exploiting biased sampling. We proved that BSBH is able to construct ε - approximate biased histogram for the case of having to concept shifts. Hamid Mousavi40UCLA, CSD, Winter 2014

Rank Error vs. Size Error Algorithms A 1 and A2 construct two 5-bucket equi-depth histograms as follows: Bucket sizes Boundaries Ranks A 1 : 5, 10, 10, 10, 15 5, 15, 25, 35, 50 A2: 5, 20, 1, 19, 5 5, 25, 26, 45, 50 41H. Mousavi and C. Zaniolo