SEBD Tutorial, June 2006 1 Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren.

Slides:



Advertisements
Similar presentations
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
Advertisements

Chapter 5: Tree Constructions
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Adopt Algorithm for Distributed Constraint Optimization
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Decentralized Reactive Clustering in Sensor Networks Yingyue Xu April 26, 2015.
Exact Inference in Bayes Nets
A Survey on Tracking Methods for a Wireless Sensor Network Taylor Flagg, Beau Hollis & Francisco J. Garcia-Ascanio.
Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.
Distributed Top-K Monitoring. Outline Introduction Related work Algorithm for distributed Top-K monitoring Experiments Summary.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Graph & BFS.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Agent-Based Coordination of Sensor Networks Alex Rogers School of Electronics and Computer Science University of Southampton
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
1 Energy-Efficient localization for networks of underwater drifters Diba Mirza Curt Schurgers Department of Electrical and Computer Engineering.
Location Estimation in Sensor Networks Moshe Mishali.
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Department of Computer Science Stony Brook University.
1 Toward Sophisticated Detection With Distributed Triggers Ling Huang* Minos Garofalakis § Joe Hellerstein* Anthony Joseph* Nina Taft § *UC Berkeley §
1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Taming the Underlying Challenges of Reliable Multihop Routing in Sensor Networks.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Probability Grid: A Location Estimation Scheme for Wireless Sensor Networks Presented by cychen Date : 3/7 In Secon (Sensor and Ad Hoc Communications and.
Simultaneous Rate and Power Control in Multirate Multimedia CDMA Systems By: Sunil Kandukuri and Stephen Boyd.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.
8/5/ Monitoring Big, Distributed, Streaming Data Daniel Keren, Haifa U Tsachi Sharfman, Technion Assaf Schuster, Technion.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.
Energy Efficient Routing and Self-Configuring Networks Stephen B. Wicker Bart Selman Terrence L. Fine Carla Gomes Bhaskar KrishnamachariDepartment of CS.
Efficient Gathering of Correlated Data in Sensor Networks
Department of Computer Science Provenance-based Trustworthiness Assessment in Sensor Networks Elisa Bertino CERIAS and Department of Computer Science,
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
10/5/ Geometric Approach Geometric Interpretation: Geometric Interpretation: Each node holds a statistics vector Each node holds a statistics vector.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Distributed Anomaly Detection in Wireless Sensor Networks Ksutharshan Rajasegarar, Christopher Leckie, Marimutha Palaniswami, James C. Bezdek IEEE ICCS2006(Institutions.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Energy-Efficient Signal Processing and Communication Algorithms for Scalable Distributed Fusion.
Spatial Interpolation III
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Time This powerpoint presentation has been adapted from: 1) sApr20.ppt.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Vertex Coloring Distributed Algorithms for Multi-Agent Networks
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Seminar On Rain Technology
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Igor Burdonov Alexander Kossatchev Building direct and back spanning trees by automata on a graph The Institute for System Programming (ISP) of the Russian.
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
Geometric Approach Geometric Interpretation:
Plethora: Infrastructure and System Design
Spatial Online Sampling and Aggregation
Gal Yehuda Daniel Keren
Presentation transcript:

SEBD Tutorial, June Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren

SEBD Tutorial, June Sources A Geometric Approach to Monitoring Distributed Data Streams, SIGMOD 06 (Honorable Mention) A Geometric Approach to Monitoring Distributed Data Streams, SIGMOD 06 (Honorable Mention) Aggregate Threshold Queries in Sensor Networks, Submitted to SENSYS 06 Aggregate Threshold Queries in Sensor Networks, Submitted to SENSYS 06 Monitoring Many Features in Distributed Data Streams. In preparation for ICDM 06. Monitoring Many Features in Distributed Data Streams. In preparation for ICDM 06.

SEBD Tutorial, June Problem Definition A set of distributed data streams A set of distributed data streams Mirrored web site Mirrored web site Distributed spam filtering system Distributed spam filtering system A sensor network A sensor network A data vector is collected from each stream A data vector is collected from each stream Stream is infinite Stream is infinite Sliding/jumping windows Sliding/jumping windows Given: A function over the average of the data vectors Given: A function over the average of the data vectors Given: A predetermined threshold Given: A predetermined threshold Question: did the function value cross the threshold? Question: did the function value cross the threshold?

SEBD Tutorial, June Example 1: Web Page Frequency Counts Mirrored web site Mirrored web site Each mirror maintains the frequency each page was accessed in last 5 min. Each mirror maintains the frequency each page was accessed in last 5 min. We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold) We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold)

SEBD Tutorial, June Example 2: Air Quality Monitoring Sensors monitoring the concentration of air pollutants. Sensors monitoring the concentration of air pollutants. Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO 2, SO 2, O 3, etc.). Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO 2, SO 2, O 3, etc.). A function on the average data vector determines the Air Quality Index (AQI) A function on the average data vector determines the Air Quality Index (AQI) Alert in case the AQI exceeds a given threshold. Alert in case the AQI exceeds a given threshold.

SEBD Tutorial, June Example 3: Variance Alert Sensors monitoring the temperature in a server room (machine room, conference room, etc.) Sensors monitoring the temperature in a server room (machine room, conference room, etc.) Ensure uniform temp.: monitor variance of readings Ensure uniform temp.: monitor variance of readings Alert in case variance exceeds a threshold Alert in case variance exceeds a threshold Temperature readings by n sensors x 1, …, x n Temperature readings by n sensors x 1, …, x n Each sensor holds a data vector v i = (x i 2, x i ) T Each sensor holds a data vector v i = (x i 2, x i ) T The average data vector is v = The average data vector is v = Var(all sensors) = Var(all sensors) =

SEBD Tutorial, June Example 4 (running example): Distributed Feature Selection A distributed spam mail filtering system. A distributed spam mail filtering system. A mail server receives a stream of positive and negative examples. A mail server receives a stream of positive and negative examples. Select a set of features (words) to be used in order to build a spam classifier. Select a set of features (words) to be used in order to build a spam classifier. A feature is good if its information gain is above a threshold. A feature is good if its information gain is above a threshold.

SEBD Tutorial, June Distributed Calculation of Information Gain Each server maintains a contingency table for each feature. Each server maintains a contingency table for each feature. We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold. We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold. Spam^Spam C i,j = f ^f0.20.5

SEBD Tutorial, June Distributed Calculation of Information Gain – continued Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! C1 =C1 =C1 =C1 = C2 =C2 =C2 =C2 = IG(C 1 )=1 IG(C 2 )=1

SEBD Tutorial, June Pervious Work Focused on linear functions (e.g., sum, average): Focused on linear functions (e.g., sum, average): M. Dilman and D. Raz. Efficient reactive monitoring. In INFOCOM, pages 1012–1019, Pervious solutions for arbitrary Functions included only Naïve Algorithms Pervious solutions for arbitrary Functions included only Naïve Algorithms All data is moved to a central place All data is moved to a central place Communication overhead Communication overhead CPU overhead CPU overhead Power overhead Power overhead Privacy issues Privacy issues

SEBD Tutorial, June Novel Geometric Approach Geometric Interpretation: Geometric Interpretation: Each node hold a statistics vector Each node hold a statistics vector Coloring the vector space Coloring the vector space Grey:: function > threshold Grey:: function > threshold White:: function <= threshold White:: function <= threshold Goal: determine color of global data vector (average). Goal: determine color of global data vector (average).

SEBD Tutorial, June Geometric Approach – Bounding the Convex Hull Observation: average is in the convex hull of drift vectors Observation: average is in the convex hull of drift vectors If convex hull monochromatic then average is same color If convex hull monochromatic then average is same color

SEBD Tutorial, June Drift Vectors Rather than bounding the convex hull of the statistics vector: Rather than bounding the convex hull of the statistics vector: Periodically calculate an estimate vector - the current global value Periodically calculate an estimate vector - the current global value Each node maintains a drift vector – the change in the local statistics vector since the last time an estimate vector has been calculated (in relation to the estimate vector) Each node maintains a drift vector – the change in the local statistics vector since the last time an estimate vector has been calculated (in relation to the estimate vector) The global statistics vector is the average of the drift vectors The global statistics vector is the average of the drift vectors

SEBD Tutorial, June Distributively Bounding the Convex Hull A reference point is known to all nodes A reference point is known to all nodes Each node constructs a ball Each node constructs a ball Theorem: convex hull is bound by the union of balls Theorem: convex hull is bound by the union of balls

SEBD Tutorial, June Basic Algorithm Basic Algorithm An initial estimate vector is calculated An initial estimate vector is calculated Nodes check color of drift sphere Nodes check color of drift sphere Drift vector is the diameter of the drift ball Drift vector is the diameter of the drift ball If any ball non monochromatic synchronize nodes If any ball non monochromatic synchronize nodes

SEBD Tutorial, June Reuters Corpus (RCV1-v2) 800,000+ news stories 800,000+ news stories Aug Aug Aug Aug Corporate/Industrial tagging simulates spam Corporate/Industrial tagging simulates spam n=10

SEBD Tutorial, June Trade-off: Accuracy vs. Performance Inefficiency: value of function on average is close to the threshold Inefficiency: value of function on average is close to the threshold Performance can be enhanced at the cost of less accurate result: Performance can be enhanced at the cost of less accurate result: Set error margin around the threshold value Set error margin around the threshold value

SEBD Tutorial, June Scalability # messages per node is constant.

SEBD Tutorial, June Balancing Globally calculating average is costly Globally calculating average is costly Often possible to average only some of the data vectors. Often possible to average only some of the data vectors.

SEBD Tutorial, June Computational Complexity of Calculating Distance from Zero Surface Closed form solutions (Variance alert) Closed form solutions (Variance alert) Numerical Methods Numerical Methods Offline Computations and Caching Offline Computations and Caching

SEBD Tutorial, June Performance Analysis

SEBD Tutorial, June Performance Analysis (continued)

SEBD Tutorial, June Performance Analysis (continued)

SEBD Tutorial, June Upper Bounds on Probability of Constraint Violation

SEBD Tutorial, June Tiered Sensor Networks Network comprised of two types of sensors, Macro-Nodes and Motes Network comprised of two types of sensors, Macro-Nodes and Motes Motes: Motes: Simple, inexpensive sensing units Simple, inexpensive sensing units Based on 8-bit processors Based on 8-bit processors Macro Nodes: Macro Nodes: Less resource constrained Less resource constrained Based on 32-bit processors. Support more advanced OS and development tools Based on 32-bit processors. Support more advanced OS and development tools

SEBD Tutorial, June Monitoring Sensor Networks (1) A spanning tree is constructed over the connectivity graph A spanning tree is constructed over the connectivity graph Initial measurement vector aggregated over the tree, and flooded to all Motes Initial measurement vector aggregated over the tree, and flooded to all Motes Motes use aggregated vector as estimate vector Motes use aggregated vector as estimate vector An attempt is made to balance constraint violations within the cluster (intra cluster balancing): An attempt is made to balance constraint violations within the cluster (intra cluster balancing): Cluster Head iteratively selects motes and requests their drift vectors Cluster Head iteratively selects motes and requests their drift vectors Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector

SEBD Tutorial, June Monitoring Sensor Networks (2) In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing) : In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing) : The token consists of the average of the drift vectors held by the motes in the clusters the token has visited The token consists of the average of the drift vectors held by the motes in the clusters the token has visited Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector

SEBD Tutorial, June Monitoring Sensor Networks (3) Token traversal implemented as a DFS search Token traversal implemented as a DFS search Several tokens may simultaneously traverse the network, in which case they may be required to merge Several tokens may simultaneously traverse the network, in which case they may be required to merge

SEBD Tutorial, June Data Set A 144x36 data points of temperature readings in the northern hemisphere A 144x36 data points of temperature readings in the northern hemisphere Readings are taken every 6h for a period of a year Readings are taken every 6h for a period of a year Strong Spatial and Temporal correlation among data readings Strong Spatial and Temporal correlation among data readings Average temperature ranges from to 15 degrees Centigrade Average temperature ranges from to 15 degrees Centigrade

SEBD Tutorial, June Experimental Results - Threshold

SEBD Tutorial, June Experimental Results – Error Margin

SEBD Tutorial, June Experimental Results – Cluster Size

SEBD Tutorial, June Window Size

SEBD Tutorial, June Simultaneous Features

SEBD Tutorial, June Future Work Efficiently monitoring multiple objects Efficiently monitoring multiple objects Exploiting Correlations among objects Exploiting Correlations among objects Monitoring Top-k objects Monitoring Top-k objects Improving spherical bounds Improving spherical bounds Large scale networks Large scale networks

SEBD Tutorial, June Chi-SquareSpam^Spam A =A =A =A =f x1x1x1x1 x2x2x2x2 ^f x3x3x3x3 x4x4x4x4

SEBD Tutorial, June Questions?

SEBD Tutorial, June Bounding Theorem – Proof (1)

SEBD Tutorial, June Bounding Theorem – Proof (2)

SEBD Tutorial, June Bounding Theorem – Proof (3)

SEBD Tutorial, June Bounding Theorem – Proof (4)

SEBD Tutorial, June Bounding Theorem – Proof (5)