Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEBD Tutorial, June 2006 1 Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren.

Similar presentations


Presentation on theme: "SEBD Tutorial, June 2006 1 Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren."— Presentation transcript:

1 SEBD Tutorial, June 2006 1 Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren

2 SEBD Tutorial, June 2006 2 Sources A Geometric Approach to Monitoring Distributed Data Streams, SIGMOD 06 (Honorable Mention) A Geometric Approach to Monitoring Distributed Data Streams, SIGMOD 06 (Honorable Mention) Aggregate Threshold Queries in Sensor Networks, Submitted to SENSYS 06 Aggregate Threshold Queries in Sensor Networks, Submitted to SENSYS 06 Monitoring Many Features in Distributed Data Streams. In preparation for ICDM 06. Monitoring Many Features in Distributed Data Streams. In preparation for ICDM 06.

3 SEBD Tutorial, June 2006 3 Problem Definition A set of distributed data streams A set of distributed data streams Mirrored web site Mirrored web site Distributed spam filtering system Distributed spam filtering system A sensor network A sensor network A data vector is collected from each stream A data vector is collected from each stream Stream is infinite Stream is infinite Sliding/jumping windows Sliding/jumping windows Given: A function over the average of the data vectors Given: A function over the average of the data vectors Given: A predetermined threshold Given: A predetermined threshold Question: did the function value cross the threshold? Question: did the function value cross the threshold?

4 SEBD Tutorial, June 2006 4 Example 1: Web Page Frequency Counts Mirrored web site Mirrored web site Each mirror maintains the frequency each page was accessed in last 5 min. Each mirror maintains the frequency each page was accessed in last 5 min. We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold) We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold)

5 SEBD Tutorial, June 2006 5 Example 2: Air Quality Monitoring Sensors monitoring the concentration of air pollutants. Sensors monitoring the concentration of air pollutants. Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO 2, SO 2, O 3, etc.). Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO 2, SO 2, O 3, etc.). A function on the average data vector determines the Air Quality Index (AQI) A function on the average data vector determines the Air Quality Index (AQI) Alert in case the AQI exceeds a given threshold. Alert in case the AQI exceeds a given threshold.

6 SEBD Tutorial, June 2006 6 Example 3: Variance Alert Sensors monitoring the temperature in a server room (machine room, conference room, etc.) Sensors monitoring the temperature in a server room (machine room, conference room, etc.) Ensure uniform temp.: monitor variance of readings Ensure uniform temp.: monitor variance of readings Alert in case variance exceeds a threshold Alert in case variance exceeds a threshold Temperature readings by n sensors x 1, …, x n Temperature readings by n sensors x 1, …, x n Each sensor holds a data vector v i = (x i 2, x i ) T Each sensor holds a data vector v i = (x i 2, x i ) T The average data vector is v = The average data vector is v = Var(all sensors) = Var(all sensors) =

7 SEBD Tutorial, June 2006 7 Example 4 (running example): Distributed Feature Selection A distributed spam mail filtering system. A distributed spam mail filtering system. A mail server receives a stream of positive and negative examples. A mail server receives a stream of positive and negative examples. Select a set of features (words) to be used in order to build a spam classifier. Select a set of features (words) to be used in order to build a spam classifier. A feature is good if its information gain is above a threshold. A feature is good if its information gain is above a threshold.

8 SEBD Tutorial, June 2006 8 Distributed Calculation of Information Gain Each server maintains a contingency table for each feature. Each server maintains a contingency table for each feature. We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold. We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold. Spam^Spam C i,j = f0.10.2 ^f0.20.5

9 SEBD Tutorial, June 2006 9 Distributed Calculation of Information Gain – continued Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! C1 =C1 =C1 =C1 = 0.50 00.5 C2 =C2 =C2 =C2 =00.50.50 IG(C 1 )=1 IG(C 2 )=1

10 SEBD Tutorial, June 2006 10 Pervious Work Focused on linear functions (e.g., sum, average): Focused on linear functions (e.g., sum, average): M. Dilman and D. Raz. Efficient reactive monitoring. In INFOCOM, pages 1012–1019, 2001. Pervious solutions for arbitrary Functions included only Naïve Algorithms Pervious solutions for arbitrary Functions included only Naïve Algorithms All data is moved to a central place All data is moved to a central place Communication overhead Communication overhead CPU overhead CPU overhead Power overhead Power overhead Privacy issues Privacy issues

11 SEBD Tutorial, June 2006 11 Novel Geometric Approach Geometric Interpretation: Geometric Interpretation: Each node hold a statistics vector Each node hold a statistics vector Coloring the vector space Coloring the vector space Grey:: function > threshold Grey:: function > threshold White:: function <= threshold White:: function <= threshold Goal: determine color of global data vector (average). Goal: determine color of global data vector (average).

12 SEBD Tutorial, June 2006 12 Geometric Approach – Bounding the Convex Hull Observation: average is in the convex hull of drift vectors Observation: average is in the convex hull of drift vectors If convex hull monochromatic then average is same color If convex hull monochromatic then average is same color

13 SEBD Tutorial, June 2006 13 Drift Vectors Rather than bounding the convex hull of the statistics vector: Rather than bounding the convex hull of the statistics vector: Periodically calculate an estimate vector - the current global value Periodically calculate an estimate vector - the current global value Each node maintains a drift vector – the change in the local statistics vector since the last time an estimate vector has been calculated (in relation to the estimate vector) Each node maintains a drift vector – the change in the local statistics vector since the last time an estimate vector has been calculated (in relation to the estimate vector) The global statistics vector is the average of the drift vectors The global statistics vector is the average of the drift vectors

14 SEBD Tutorial, June 2006 14 Distributively Bounding the Convex Hull A reference point is known to all nodes A reference point is known to all nodes Each node constructs a ball Each node constructs a ball Theorem: convex hull is bound by the union of balls Theorem: convex hull is bound by the union of balls

15 SEBD Tutorial, June 2006 15 Basic Algorithm Basic Algorithm An initial estimate vector is calculated An initial estimate vector is calculated Nodes check color of drift sphere Nodes check color of drift sphere Drift vector is the diameter of the drift ball Drift vector is the diameter of the drift ball If any ball non monochromatic synchronize nodes If any ball non monochromatic synchronize nodes

16 SEBD Tutorial, June 2006 16 Reuters Corpus (RCV1-v2) 800,000+ news stories 800,000+ news stories Aug 20 1996 -- Aug 19 1997 Aug 20 1996 -- Aug 19 1997 Corporate/Industrial tagging simulates spam Corporate/Industrial tagging simulates spam n=10

17 SEBD Tutorial, June 2006 17 Trade-off: Accuracy vs. Performance Inefficiency: value of function on average is close to the threshold Inefficiency: value of function on average is close to the threshold Performance can be enhanced at the cost of less accurate result: Performance can be enhanced at the cost of less accurate result: Set error margin around the threshold value Set error margin around the threshold value

18 SEBD Tutorial, June 2006 18 Scalability # messages per node is constant.

19 SEBD Tutorial, June 2006 19 Balancing Globally calculating average is costly Globally calculating average is costly Often possible to average only some of the data vectors. Often possible to average only some of the data vectors.

20 SEBD Tutorial, June 2006 20 Computational Complexity of Calculating Distance from Zero Surface Closed form solutions (Variance alert) Closed form solutions (Variance alert) Numerical Methods Numerical Methods Offline Computations and Caching Offline Computations and Caching

21 SEBD Tutorial, June 2006 21 Performance Analysis

22 SEBD Tutorial, June 2006 22 Performance Analysis (continued)

23 SEBD Tutorial, June 2006 23 Performance Analysis (continued)

24 SEBD Tutorial, June 2006 24 Upper Bounds on Probability of Constraint Violation

25 SEBD Tutorial, June 2006 25 Tiered Sensor Networks Network comprised of two types of sensors, Macro-Nodes and Motes Network comprised of two types of sensors, Macro-Nodes and Motes Motes: Motes: Simple, inexpensive sensing units Simple, inexpensive sensing units Based on 8-bit processors Based on 8-bit processors Macro Nodes: Macro Nodes: Less resource constrained Less resource constrained Based on 32-bit processors. Support more advanced OS and development tools Based on 32-bit processors. Support more advanced OS and development tools

26 SEBD Tutorial, June 2006 26 Monitoring Sensor Networks (1) A spanning tree is constructed over the connectivity graph A spanning tree is constructed over the connectivity graph Initial measurement vector aggregated over the tree, and flooded to all Motes Initial measurement vector aggregated over the tree, and flooded to all Motes Motes use aggregated vector as estimate vector Motes use aggregated vector as estimate vector An attempt is made to balance constraint violations within the cluster (intra cluster balancing): An attempt is made to balance constraint violations within the cluster (intra cluster balancing): Cluster Head iteratively selects motes and requests their drift vectors Cluster Head iteratively selects motes and requests their drift vectors Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector

27 SEBD Tutorial, June 2006 27 Monitoring Sensor Networks (2) In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing) : In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing) : The token consists of the average of the drift vectors held by the motes in the clusters the token has visited The token consists of the average of the drift vectors held by the motes in the clusters the token has visited Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector

28 SEBD Tutorial, June 2006 28 Monitoring Sensor Networks (3) Token traversal implemented as a DFS search Token traversal implemented as a DFS search Several tokens may simultaneously traverse the network, in which case they may be required to merge Several tokens may simultaneously traverse the network, in which case they may be required to merge

29 SEBD Tutorial, June 2006 29 Data Set A 144x36 data points of temperature readings in the northern hemisphere A 144x36 data points of temperature readings in the northern hemisphere Readings are taken every 6h for a period of a year Readings are taken every 6h for a period of a year Strong Spatial and Temporal correlation among data readings Strong Spatial and Temporal correlation among data readings Average temperature ranges from -3.15 to 15 degrees Centigrade Average temperature ranges from -3.15 to 15 degrees Centigrade

30 SEBD Tutorial, June 2006 30 Experimental Results - Threshold

31 SEBD Tutorial, June 2006 31 Experimental Results – Error Margin

32 SEBD Tutorial, June 2006 32 Experimental Results – Cluster Size

33 SEBD Tutorial, June 2006 33 Window Size

34 SEBD Tutorial, June 2006 34 Simultaneous Features

35 SEBD Tutorial, June 2006 35 Future Work Efficiently monitoring multiple objects Efficiently monitoring multiple objects Exploiting Correlations among objects Exploiting Correlations among objects Monitoring Top-k objects Monitoring Top-k objects Improving spherical bounds Improving spherical bounds Large scale networks Large scale networks

36 SEBD Tutorial, June 2006 36 Chi-SquareSpam^Spam A =A =A =A =f x1x1x1x1 x2x2x2x2 ^f x3x3x3x3 x4x4x4x4

37 SEBD Tutorial, June 2006 37 Questions?

38 SEBD Tutorial, June 2006 38 Bounding Theorem – Proof (1)

39 SEBD Tutorial, June 2006 39 Bounding Theorem – Proof (2)

40 SEBD Tutorial, June 2006 40 Bounding Theorem – Proof (3)

41 SEBD Tutorial, June 2006 41 Bounding Theorem – Proof (4)

42 SEBD Tutorial, June 2006 42 Bounding Theorem – Proof (5)


Download ppt "SEBD Tutorial, June 2006 1 Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren."

Similar presentations


Ads by Google