Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Lecture 1: Logical, Physical & Casual Time (Part 2) Anish Arora CSE 763.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Review: Routing algorithms Distance Vector algorithm. –What information is maintained in each router? –How to distribute the global network information?
Maintaining Sliding Widow Skylines on Data Streams.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Efficient Constraint Monitoring Using Adaptive Thresholds Srinivas Kashyap, IBM T. J. Watson Research Center Jeyashankar Ramamirtham, Netcore Solutions.
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
Distributed Set-Expression Cardinality Estimation Abhinandan Das (Cornell U.) Sumit Ganguly (I.I.T. Kanpur) Minos Garofalakis (Bell Labs.) Rajeev Rastogi.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Detecting SYN-Flooding Attacks Aaron Beach CS 395 Network Secu rity Spring 2004.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
An adaptive framework of multiple schemes for event and query distribution in wireless sensor networks Vincent Tam, Keng-Teck Ma, and King-Shan Lui IEEE.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Intel Research Sketching Streams through the Net: Distributed Approximate Query Tracking Graham Cormode Bell Laboratories Minos Garofalakis.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.
Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.
Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Communication Systems 3.1) Characteristics of a Communication System.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
RMON 1. RMON is a set of standardized MIB variables that monitor networks. Even if RMON initially referred to only the RMON MIB, the term RMON now is.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
1 Chapter 11 Global Properties (Distributed Termination)
SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)
SketchVisor: Robust Network Measurement for Software Packet Processing
Mining Data Streams (Part 1)
The Stream Model Sliding Windows Counting 1’s
Streaming & sampling.
Network Administration CNET-443
ECE 544 Protocol Design Project 2016
Rank Aggregation.
Range-Efficient Computation of F0 over Massive Data Streams
Network-Wide Routing Oblivious Heavy Hitters
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Lu Tang , Qun Huang, Patrick P. C. Lee
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation Lap-Kei Lee Max Planck Institute for Informatics MotivationStream StatisticsBasic Counting Suppose a set of routers are monitoring a network. To allow an early detection of a Distributed Denial-of-service (DDoS) attack, we need to answer at any time: In the IP packets received by all the routers over the last hour, does there exist any frequent destination address? The problem is to minimize the communication overhead to maintain such statistics. We study algorithms that enable the root to answer the following classical ε-approximate queries, where 0 < ε ≤ 1 is a user-specified error bound. Let c be the total count of all items in the stream. Basic Counting: Estimate the total count c with absolute error εc. Frequent Items: Let 0 <φ ≤ 1 be a user-specified threshold. The query asks for a set of items, which contains  all frequent items appearing at least φ c times.  some items appearing at least (φ - ε) c times. It is well-known that to return frequent items, it suffices to answer Approximate Counting: Estimate the count of any item with absolute error εc. Quantiles: Given any 0 <φ ≤ 1, in the sorted order of all items in the stream, return an item with rank in [(φ - ε) c, (φ + ε) c]. Stream statistics can be computed over  whole data stream  a sliding window of recent items: all items with timestamps in the most recent W time units, where W is the window size. Sliding window is more difficult to handle as items will expire. Algorithm. Let λ=ε/9. Each remote site: keeps an λ-approximate local count. Let c cur and c old be current estimate and the previously sent estimate. At any time, it updates the root about c cur if the following event occurs  Up event: c cur - c old > 4λ c old  Down event: c old - c cur > 4λ c old The root: upon a query, returns the sum of all the k estimates received. Analysis techniques. In a remote site, let n be the number of items arriving or expiring in the current sliding window [t 1,t 2 ]. If items do not expire, only Up’s occur. When an Up occurs, n increases by a fraction of (1+Θ(ε)), so no. of Up’s is. Expiry of items destroys the monotonic property of n. Our idea is to define a characteristic set of items as a new measure of progress.  Up event: The set is the items arriving from t 1 up to the time the Up occurs. When an Up occurs, the size of this set increases by a fraction of (1+Θ(ε)), and thus the no. of Up’s is.  Down event: The set is the items expiring from t 1 up to the time the Down occurs. Similarly, the no. of Down’s is. The total no. of events is To reduce message size, we restrict the estimates to a predefined set, giving the required upper bound. Distributed Data Stream Model We have k ≥ 1 remote sites and a single root (or coordinator) distant apart.  Each remote site is monitoring a stream of items, where each item contains an item label and a timestamp. As the stream contains a massive volume of data items, the remote site cannot store the whole stream for processing, and hence each remote site can only maintain an approximation of some stream statistics on its own stream.  The root is required to compute the (approximate) global statistics on the union of the k streams.  Only communication between the root and each remote site is allowed; remote sites cannot communicate with each other. This restriction is practical; e.g., in a sensor network, the sensors are cheap devices with limited processing power and memory, and they cannot communicate with each other. Algorithms in this model are communication protocols for the root and remote sites. They can be classified into two types:  Two-way algorithms: bi-directional communication between the root and remote sites are allowed.  One-way algorithms: only the remote sites are allowed to communicate with the root and the root cannot send message to any remote site. One-way algorithms are usually simple and thus easy to implement, as each remote site has only local information about its own stream; the best the remote site can do is to update the root if its local statistics deviate too much from the one it previously sent to the root. Two-way communication allows more sophisticated algorithm. Thus, it is believed that two-way algorithms are more communication-efficient than one-way algorithms. Approximate Counting Previous and Our Results Algorithm. Let λ=ε/11. Each remote site: keeps an λ-approximate item count c cur (j) for each item j, and an λ/6-approximate total count c cur. At any time, for each item j, it updates c cur (j) if the following event occurs  Up event: c cur (j) - c old (j) > 9λ c cur  Down event: c old (j) - c cur (j) > 9λ c cur The root: upon a query on item j, returns the total estimate of j received. Analysis techniques. Up’s can be caused by a big increase of item j, or a small increase of j but significant drop of c cur, which requires different characteristic set analysis. Furthermore, the communication cost would depend on the no. of possible items. We observe that if c cur (j) < 3λ c cur, we can treat c cur (j) = 0 and stop updating j until c cur (j) ≥ 3λ. This keeps ‘active’ items at any time, leading to the required upper bound. All previous work focuses on the whole data stream setting. Let N be the number of items in all the k streams.  Basic Counting: (one-way) [2]  Frequent Items / Approximate Counting: (one-way) [1]; (two-way) [4]  Quantiles: (one-way) [1]; (two-way) [4] Our results [3] extend the above study to the sliding window setting. Let N be the number of items in all the k streams that arrive or expire within the current sliding window of W time units. We show upper bounds of one-way algorithms and lower bounds for any two-way algorithms.  Basic Counting: (this result also improves that in [2])  Frequent Items / Approximate Counting:  Quantiles: Our upper bounds match or nearly match the lower bounds, which reveals that for these statistics, two-way algorithms could not be much better than one-way algorithms in the worst case. References [1]Cormode, Garofalakis, Muthukrishnan, Rastogi. Holistic aggregates in a networked world: distributed tracking of approximate quantiles. SIGMOD [2]Keralapura, Cormode, Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. SIGMOD [3]Chan, Lam, Lee, Ting. Continuous monitoring of distributed data streams over a time-based sliding window. STACS [4]Yi, Zhang. Optimal tracking of distributed heavy hitters and quantiles. PODS A Time 1 B Time 2 A Time 5 C Time 8 D Time 9 E Time 9