Sketch-based Querying of Distributed Sliding-window Data Streams

Slides:



Advertisements
Similar presentations
Variations of the Turing Machine
Advertisements

Chapter 13: Query Processing
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Sketch-based Change Detection Balachander Krishnamurthy (AT&T) Subhabrata Sen (AT&T) Yin Zhang (AT&T) Yan Chen (UCB/AT&T) ACM Internet Measurement Conference.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Addition Facts
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Solve Multi-step Equations
Break Time Remaining 10:00.
Chapter 4: Informed Heuristic Search
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Von Karman Institute for Fluid Dynamics RTO, AVT 167, October, R.A. Van den Braembussche von Karman Institute for Fluid Dynamics Tuning of Optimization.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
The Weighted Proportional Resource Allocation Milan Vojnović Microsoft Research Joint work with Thành Nguyen Microsoft Research Asia, Beijing, April, 2011.
Briana B. Morrison Adapted from William Collins
Analysis of Algorithms CS 477/677
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
Mobile IP: Multicast Service Reference: Multicast routing protocol in mobile networks; Hee- Sook Shin; Young-Joo Suh;, Proc. IEEE International Conference.
Continuous Fragmented Skylines over Distributed Streams Odysseas Papapetrou and Minos Garofalakis SoftNet laboratory, Technical University of Crete.
2 |SharePoint Saturday New York City
1 Analysis of Random Mobility Models with PDE's Michele Garetto Emilio Leonardi Politecnico di Torino Italy MobiHoc Firenze.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Mathematics Involving Shape and Space a Algebra. The 9-Dot Problem.
Graphs, representation, isomorphism, connectivity
CSE 4101/5101 Prof. Andy Mirzaian. Lists Move-to-Front Search Trees Binary Search Trees Multi-Way Search Trees B-trees Splay Trees Trees Red-Black.
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Addition 1’s to 20.
25 seconds left…...
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Jan SedmidubskyOctober 28, 2011Scalability and Robustness in a Self-organizing Retrieval System Jan Sedmidubsky Vlastislav Dohnal Pavel Zezula On Investigating.
Choosing an Order for Joins
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Select a time to count down from the clock above
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
1 McGill University Department of Civil Engineering and Applied Mechanics Montreal, Quebec, Canada.
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.5) Sept, 14, 2012.
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Link State Routing Jean-Yves Le Boudec Fall
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
The Stream Model Sliding Windows Counting 1’s
Heavy Hitters in Streams and Sliding Windows
Lu Tang , Qun Huang, Patrick P. C. Lee
Maintaining Stream Statistics over Sliding Windows
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Sketch-based Querying of Distributed Sliding-window Data Streams Odysseas Papapetrou, Minos Garofalakis, Antonios Deligiannakis SoftNet laboratory, Technical University of Crete, Greece

Streams and sliding windows Querying of distributed sliding-window data streams Distributed: Many nodes/peers, many streams, aggregate statistics Cannot afford to centralize all data Sliding windows: Only interested on recent data Arrival-based model: Account for the last X items Time-based model: Account for the items arriving in the last X minutes Data streams: High-dimensional Maintain occurrences of ip addresses Maintain term frequencies in textual streams (e.g., emails) Small space/time

Motivation example: Monitoring network packet traffic Global statistics Monitor the distribution of packet traffic over IP addresses Challenge 1: Local statistics: Compactly/efficiently maintain the ip address frequencies Sliding window  use only recent packets, e.g., of last hour Queries with multiple sliding window lengths! Challenge 2: How to aggregate local statistics to get the global statistics ip freq. 10.0.3.4 121 11.2.1.5 92 20.3.5.6 281 145.4.5.3 … nj … n4 n8 n1 n2 n3 n5 n6 n7 Local statistics ip freq. 10.0.3.4 12 20.3.5.6 120 111.1.2.3 2 121.2.1.1 11 145.4.5.3 18 …

Solution desiderata Need a method/data structure to maintain the (local) stream statistics: Ability to handle sliding windows of abritrary length Fast Up to 10 million network packets per second Small memory footprint Routers: MB of memory Network-efficient Local statistics exchanged over the network Composable Aggregating of local statistics to derive global statistics Our direction Trade off statistics accuracy for efficiency (memory, network) Sketches: Lossy summarizations of data streams

Count-min sketches [Cormode, Muthukrishnan‘05] Generic sketch for maintaining frequencies, frequency moments, etc... An array of w x d counters Each row i associated with a hash function hi with range [1, w] w counters +1 +1 Add x +1 h1(x) = 7 h2(x) = 1 h4(x) = 6 h3(x) = 4 +1 d hash functions +1 STREAM Example: x, y, z, … can correspond to ip addresses +1 x, 10z, y, x, 20y, 3k … +1 +1

Count-min sketches Estimating the frequency (point queries) overestimate due to hashing collisions Error relative to the stream size Also enables inner join and self join queries! w counters 23 17 22 32 13 11 44 45 52 15 78 43 74 9 63 8 2 53 56 93 12 6 46 34 33 62 55 84 77 54 7 82 73 64 35 41 14 20 10 51 21 5 16 50 59 Example: Query x: d hash functions

Sliding windows But… Sketches do not support sliding windows Several sliding window structures proposed Exponential histograms, deterministic waves, randomized waves, ... Only simple statistics, e.g., count the number of one-bits over sliding windows This work: Combine count-min sketches with sliding window structures Time 100101101110101010111……..….0101101010101010 Stream Window to monitor

Exponential histograms [Datar et al.‘02] Exponential histograms (and deterministic waves) Key idea break the sliding window range in non-overlapping buckets of exponentially increasing sizes use these buckets for maintaining and estimating the aggregates E.g., time 1 - 27: 8 one-bits arrived time 27 – 35: 4 one-bits, … Query execution: sum only the buckets in the query range, and half of the weight of the last bucket b1 b2 b3 b4 b5 8 4 2 1 Time: 1 27 35 42 47 51 Bucket information Ending time Number of one-bits Required memory:

Two distinct functionalities ECM-sketches Two distinct functionalities Sketches: Summarize distributions, no sliding window functionality Sliding window data structures: only simple statistics Our contributions ECM-sketches Combines count-min sketches with sliding windows Compact data stream summaries over sliding windows Probabilistic guarantees for frequency, self join/inner product queries

ECM-sketches w counters Counters are sliding windows Exponential histograms Deterministic waves Randomized waves ... Updated and queried as with standard count-min sketches d hash functions b1 b2 b3 b4 b5 8 4 2 1 Time: 1 27 35 42 47 51

ECM-sketches t1,+1 t1,+1 Add (t1,z) t1,+1 h3(z) = 8 h4(z) = 6 Combine count-min sketches with sliding windows Example: STREAM: (t1,z), (t3, 6x), (t5, y), ... Error coming from both hash collisions and the sliding window counters estimation Desired ε  the algorithm chooses the optimal configuration (d, w, sliding window) Total size depends on the sliding window structure (detailed analysis in the paper) Challenge 1: Maintaining of data stream statistics over sliding windows t1,+1 t1,+1 Add (t1,z) t1,+1 h3(z) = 8 h4(z) = 6 h2(z) = 2 h1(z) = 5 t1,+1 d hash functions t1,+1 Query (t2,z) t1,+1 t1,+1 w counters

Aggregating ECM-sketches Order-preserving aggregation Stream 1: (1, A), (2, B), (10, C), (11, A), (17, D), (18, B), … Stream 2: (3, B), (6, A), (13, A), (14, A), (22, D), (27, B), … Aggregate: (1, A), (2, B), (3, B), (6, A), (10, C), (11, A), (13, A), (14, A), … Composition of ECM-sketches: compose the corresponding counters Requires composition of sliding windows! Randomized sliding window structures Trivial lossless aggregation, very expensive (computation, memory, network) Deterministic sliding window structures More compact and efficient, do not trivially support aggregation … nj … n4 n8 + h … n1 n2 n3 n5 n6 n7 + +

Aggregation for deterministic sliding window structures Key idea: Use the sliding window buckets as logs to ‘re-play the streams’ E.g. Generate an aggregate exponential histogram as follows: For each bucket of size b, generate two events: b/2 one-bits arrive at the starting time of the bucket b/2 one-bits arrive at the ending time of the bucket Sort events based on time Construct a new exponential histogram with these events If each of the EH has error ε, then the aggregated EH has error ≈2ε (worst- case analytic prediction -- tight) Proof in the paper Result holds for any number of exponential histograms composed b1 b2 b3 b4 b5 8 4 2 1 b1 b2 b3 b4 b5 8 4 2 1 Time: 1 27 35 42 47 51 1 12 22 28 31 33

Aggregating ECM-sketches Given A, B, .... Aggregated sketch represents the order-preserving aggregation of all streams Challenge 2: Aggregation of local statistics to get global statistics E … C D + h … A B + A B C … … … + =

Experimental evaluation ECM-sketches based on Exponential histograms, deterministic waves, randomized waves ε in [0.05 , 0.25] Centralized setting: Evaluate individual ECM-sketches Distributed setting: Nodes organized in a binary tree, aggregated ECM-sketches Dataset: World-cup ’98: approx. 1.1 billion http requests (key:url) Queries: Point queries (URL frequency), and self-join queries Observed error relative to the stream size, as in conventional Count-min sketches. Sliding window of 1 million seconds (~11.5 days) More results in the paper

Estimation accuracy of ECM-sketches ECM-sketches with exponential histograms More efficient and more compact than deterministic waves At least two orders of magnitude smaller compared to randomized waves

Accuracy of aggregated ECM-sketches ECM-sketches with randomized waves: Error-free aggregation, high space complexity ECM-sketches based on deterministic sliding windows: error smaller than the worst-case analytic prediction

Enables composition with controllable error bounds Future work Conclusions ECM-sketches The first data structure to enable sliding window statistics over high-dimensional streams Enables composition with controllable error bounds Future work ECM-sketches to continuously monitor functions over distributed data Geometric method [Sharfman‘06]

Thank you for your attention… http://www.softnet.tuc.gr http://www.lift-eu.org