Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining Data Streams.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Distributed Set-Expression Cardinality Estimation Abhinandan Das (Cornell U.) Sumit Ganguly (I.I.T. Kanpur) Minos Garofalakis (Bell Labs.) Rajeev Rastogi.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
Polytechnic University,ECE Department1 Detection of “Hot Spots” Paper Title : Joint Data Streaming and Sampling Techniques for Detection of Super Sources.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Proof Sketches: Verifiable In-Network Aggregation Minos Garofalakis Yahoo! Research, UC Berkeley, Intel Research Berkeley
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Approximation Techniques for Data Management Systems “We are drowning in data but starved for knowledge” John Naisbitt CS 186 Fall 2005.
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Data Stream Processing (Part IV)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Graham Cormode S. Muthukrishnan
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Stream Algorithms Intro, Sampling, Entropy
Sketching Techniques for Massive Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
Streaming & sampling.
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev Rastogi)

2 Motivation: Massive Network-Data Streams Broadband Internet Access PSTN DSL/Cable Networks Enterprise Networks Voice over IP FR, ATM, IP VPN Network Operations Center (NOC) SNMP/RMON, NetFlow records BGP Peer  SNMP/RMON/NetFlow data records arrive 24x7 from different parts of the network  Truly massive streams arriving at rapid rates –AT&T collects GigaBytes of NetFlow data each day!  Typically shipped to a back-end data warehouse for off-line analysis Source Destination Duration Bytes Protocol K http K http K http K http K http K ftp K ftp K ftp Example NetFlow IP Session Data Converged IP/MPLS Network

3 Real-Time Data-Stream Querying  Need ability to process/analyze network-data streams in real-time –As records stream in: look at records only once in arrival order! –Within resource (CPU, memory) limitations of the NOC –Different classes of analysis queries: top-k, quantiles, joins, …  Our focus: Join-Distinct (JD) aggregate queries –Estimating cardinality of duplicate-eliminating projection over a join  Critical to important NM tasks –Denial-of-Service attacks, SLA violations, real-time traffic engineering,… DBMS (Oracle, DB2) Back-end Data Warehouse Off-line analysis – Data access is slow, expensive Converged IP/MPLS Network PSTN Enterprise Networks Network Operations Center (NOC) BGP Peer R1 R2 SELECT COUNT DISTINCT (R1.source,R1.dest) FROM R1(source,dest), R2(source,dest) WHERE R1.source = R2.source How many distinct (source,dest) pairs are seen by R1 where “source” is also seen by R2?

4 Talk Outline  Introduction & Motivation  Data Stream Computation Model  Key Prior Work –FM sketches for distinct counting –2-level hash sketches for set-expression cardinalities  Our Solution: JD-Sketch Synopses –The basic structure –JD-sketch composition algorithm & JD estimator –Extensions  Experimental Results  Conclusions

5 Data - Stream Processing Model  Approximate answers often suffice, e.g., trend analysis, anomaly detection –Exact solution requires linear space (SET-DISJOINTNESS)  Requirements for stream synopses –Single Pass: Each record is examined at most once, in (fixed) arrival order –Small Space: Log or polylog in data stream size –Real-time: Per-record processing time (to maintain synopses) must be low –Delete-Proof: Can handle record deletions as well as insertions –Composable: Built in a distributed fashion and combined later Stream Processing Engine Approximate Answer with Error Guarantees “Within 2% of exact answer with high probability” Data Stream R(A,B) Query Q = (GigaBytes) (KiloBytes) Data Stream S(B,C) R(A,B) synopsis S(B,C) synopsis Memory

6 Existing Synopses for Relational Streams?  Conventional data summaries fall short –Samples (e.g., using Reservoir Sampling) Bad for joins and DISTINCT counting, cannot handle deletions –Multi-d histograms/wavelets Construction requires multiple passes, not useful for DISTINCT clauses  Combine existing stream-sketching solutions? –Hash (aka FM) sketches for distinct-value counting –AMS sketches for join-size estimation –Fundamentally different : Hashing vs. Random linear projections Effective combination seems difficult  Our Solution: JD-Sketch stream synopses –Novel, hash-based, log-sized stream summaries –Built independently over R, S streams, then composed to give JD estimate –Strong probabilistic accuracy guarantees

7  Problem: Estimate the number of distinct items in a stream of values from [0,…, M-1]  Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2^L-1], where L = O(logM)  Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y –A value x is mapped to lsb(h(x))  Maintain FM Sketch = BITMAP array of L bits, initialized to 0 –For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1 Hash (aka FM) Sketches for Distinct Value Counting [FM85] x = 5 h(x) = lsb(h(x)) = BITMAP Data stream: Number of distinct values: 5

8 Hash (aka FM) Sketches for Distinct Value Counting [FM85]  By uniformity through h(x): Prob[ BITMAP[k]=1 ] = Prob[ ] = –Assuming d distinct values: expect d/2 to map to BITMAP[0], d/4 to map to BITMAP[1],...  Let R = position of rightmost zero in BITMAP –Use as indicator of log(d) –Estimate d = –Average several iid instances (different hash functions) to reduce estimator variance fringe of 0/1s around log(d) BITMAP position << log(d) position >> log(d) 0 L-1

9  Estimate cardinality of general set expressions over streams of updates –E.g., number of distinct (source,dest) pairs seen at both R1 and R2? |R1 R2|  2-Level Hash-Sketch (2LHS) stream synopsis: Generalizes FM sketch –First level: buckets with exponentially-decreasing probabilities (using lsb(h(x)), as in FM) –Second level: Count-signature array (logM+1 counters) One “total count” for elements in first-level bucket logM “bit-location counts” for 1-bits of incoming elements 2-Level Hash Sketches for Set Expression Cardinalities [GGR03] count7 count6 count5 count4 count3 count2count1 count0 TotCount insert(17) lsb(h(17)) 17 =

10  Build several independent 2LHS, fix a level l, and look for singleton first-level buckets at that level l  Singleton buckets and singleton element (in the bucket) are easily identified using the count signature  Singletons discovered form a distinct-value sample from the union of the streams –Frequency-independent, each value sampled with probability  Determine the fraction of “witnesses” for the set expression E in the sample, and scale-up to find the estimate for |E| Processing Set Expressions over Update Streams: Key Ideas 0 Total= Singleton element = 1010 = 10 2 Singleton bucket count signature level l

11  First level of hashing (hash fn+lsb) on the projected stream attribute  Second level of hashing (collection of independent 2LHS) on the join stream attribute  Maintenance: straightforward (based on 2LHS) –Composable, delete-proof, … The JD-Sketch Synopsis: Basic Structure Q = | (R(A,B) S(B,C))| A,C JD-sketch for R(A,B) lsb(h (a)) A

12  Input: Pair of (independently-built) parallel JD-sketches on the R(A,B) and S(B,C) streams –Same hash functions for corresponding 2LHS pairs  Output: FM-like summary (bitmap) for estimating the number of distinct joining (A,C) pairs  Key Technical Challenges –Want only (A,C) pairs that join to make it to our bitmap Idea: Use 2LHS in the A- and C-buckets to determine (approximately) if the corresponding B-multisets intersect –A- and C-values are observed independently and in arbitrary order in their respective streams Cannot directly hash arriving (A,C) pairs to a bitmap (traditional FM) -- all that we have are the JD-sketches for R, S! Idea: Employ novel, composable hash functions h (), h (), and a sketch-composition algorithm that guarantees FM-like properties Our JD Estimator: Composing JD-Sketch Synopses A C

13  Theorem: Using novel, composable linear hash functions, the above composition algorithm guarantees that –(A,C)-pairs map to final bitmap levels with exponentially-decreasing probabilities ( ) –(A,C)-pair mappings are pairwise-independent  Both facts are crucial for our analysis… Our JD Estimator: Composing JD-Sketch Synopses R(A,B) JD-sketch S(B,C) JD-sketch k m |BValueIntersect(k,m)|>=1 ? min{k,m} 1 Final, composed “FM-like” bitmap for joining (A,C) pairs

14  Build and maintain s2 independent, parallel JD-sketch pairs over the R(A,B) and S(B,C) streams  At estimation time –Compose each parallel JD-sketch pair, to obtain s2 “FM-like” bitmaps for joining (A,C) pairs –Find a level l in the composed bitmaps s.t. the fraction f of 1-bits lies in a certain range -- use f to estimate jd x Prob[level=l] Return jd f x Our JD Estimator: Estimation Algorithm & Analysis level l count(1’s) s2 jd

15  Theorem: Our JD estimator returns an ( )-estimate of JD cardinality using JD-sketches with a total space requirement of –U/T |B-value neighborhood|/ no. of joining B-values for randomly-chosen (A,C) pairs JDs with low “support” are harder to estimate  Lower bound based on information-theoretic arguments and Yao’s lemma –Our space requirements are within constant and log factors of best possible Our JD Estimator: Estimation Algorithm & Analysis

16  Other forms of JD-cardinality queries are easy to handle with JD-sketches – for instance, –One-sided (semi)joins (e.g., ) –“Full-projection” joins (e.g., ) –Just choose the right stream attributes to hash on at the two levels of the JD-sketch  Other JD-aggregates – e.g., estimating predicate selectivities over a JD operation –Key observation: Can use the JD-sketch to obtain a distinct-value sample of the JD result  For cases where |B| is small, we propose a different, -space JD synopsis and estimator –Based on simpler FM sketches built with composable hash functions –Conceptually simpler & easier to analyze, BUT requires at least linear space! Extensions | (R(A,B) S(B,C))| A,B | (R(A,B) S(B,C))| A,B,C

17 Experimental Results: JD-Sketches on Random-Graph Data

18  First space-efficient algorithmic techniques for estimating JD aggregates in the streaming model  Novel, hash-based sketch synopses –Log-sized, delete-proof (general update streams) –Independently built over individual streams –Effectively composed at estimation time to provide approximate answers with strong probabilistic accuracy guarantees  Verified effectiveness through preliminary experiments  One key technical idea: Composable Hash Functions –Build hash-based sketches on individual attributes that can be composed into a sketch for attribute combinations –Powerful idea that could have applications in other streaming problems… Conclusions

19 Thank you!

20 Experimental Results: Linear-Space JD-Estimator on Random-Graph Data