Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Similar presentations


Presentation on theme: "Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,"— Presentation transcript:

1 Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj, kubitron}@eecs.berkeley.edu June, 2004

2 June 2004 Outline Background Motivation Statistical properties of real life data streams Problem of existing approaches Our Approach  Reduce communication overhead  Recover from loss Evaluation Conclusion and future work

3 June 2004 Background Aggregate functions  MIN, MAX, AVG, COUNT, …, etc. In-Network hierarchical processing  Query propagation  Tree construction  Aggregates computed epoch by epoch Addressing fault-tolerance  Multi-root  Multi-tree  Reliable transmission A B C D E 5 3 1 2 1 2 1 Count ?

4 June 2004 Motivation Data aggregation is an important function for all network infrastructures  Sensor networks  P2P networks  Network monitoring and intrusion detection systems Exact result not achievable in face of loss and faults  High cost when adding fault-tolerance Low communication overhead, accurate approximation is crucial But, it’s difficult to achieve

5 June 2004 Observation: Comparison of Data Streams Three real-world data traces and a random trace

6 June 2004 Statistical Properties of Data Streams Density estimation for relative increment There is temporal correlation in real data stream, by which we can leverage to maintain aggregate data accuracy, while reducing communication overhead and recovering from data loss. Relative Increment is defined as:

7 June 2004 Problems in Existing Approaches Few approach exploits the temporal properties and is designed to handle data loss  Simple last-value algorithm for data loss recovery in TAG  Multi-root/tree make things worse by consuming more resource Fragile for large process groups  Need all relevant nodes for participation Difficult to trade accuracy for communication overhead  Good applications need this tradeoff  Only need approximation  But, minimize resource consumption  Centralize solution of adaptive filtering proposed by Olston et.al.

8 June 2004 Our Approach Probabilistic data aggregation: a scalable and robust approach  Exploit and leverage statistical properties of data stream in temporal domain  Apply statistical algorithms to data aggregation  Develop protocol that handles loss and failures as essential part of normal operations Nodes participate in aggregation and communication according to statistical sampling algorithm In the absence of data, estimate value using time series algorithms Differentiate between voluntary and involuntary Loss

9 June 2004 Reducing Communication Overhead Trade off between accuracy and resource consumption  Allow selective participation of nodes while maintaining aggregate accuracy  Node participates in the operation with certain probability, which is the design parameter of the algorithm Sampling strategies:  Uniform Sampling: all nodes use the identical sampling rate  Subtree-size based Sampling: sampling rate of a node is proportional to the size of its subtree  Variance based sampling: a sensor only reports a new value if it is above or below a threshold percentage its last reported value.

10 June 2004 Performance of Sampling algorithms  As fewer nodes participate, overall accuracy decreases for all algorithms.  Uniform sampling performs worst.  Variance based sampling is most accurate, Max Operation AVG Operation

11 June 2004 Observation: Long-Term Pattern in Data Data source: bandwidth measurements for the CUDI network interface on an Abilene router with 5-minute average. Daily patterns in a weekly data stream Long-term trend

12 June 2004 Two Level Representation of Data The data stream can be decomposed into two layers: the long trend (pattern), which changes slowly; the residual, high frequency but low amplitude. Monday Data Long-term trend

13 June 2004 Recovering From Loss Traditional Approaches  Last seen data as approximation for current epoch  Linear Prediction Two-Level data representation and prediction  Long term trend: B-spline estimation  High frequency residual: ARMA modeling  ARMA stands for AutoRegressive and Moving Average model, which is a standard time series technique to model chaotic data stream

14 June 2004 Two-Level Data Prediction B-spline modeling for long term trend  Piecewise continuous, low-degree B-spline can represent complex shapes  Least-square B-spline regression for two-level decomposition  B-Spline extension for future forecasting ARMA forecasting for transient oscillation  System Identification to determine the order of the model  Parameter estimation by optimization algorithm  Low complexity recursive equation for future forecasting Statistical properties for the calibration of prediction results

15 June 2004 Performance of Prediction Algorithms Performance of Prediction Algorithms For MAX Operation in Lossless Environment

16 June 2004 Performance of Prediction Algorithms Performance of prediction algorithms in lossy environments. Average loss rate of the network is 20%. The ration of loss rate between wide-area links and local links is 3:1.

17 June 2004 Summary of Results All prediction algorithms are effective in improving the accuracy of aggregation results Two-level prediction approach perform the best in all situations  Achieve more than 90% of accuracy even under each node nonparticipation with rate up to 60%  Is effective even in a high loss environment

18 June 2004 Conclusion and Future Work Apply statistical algorithms to data aggregation system  quantify the statistical properties of real-world measurement data  propose the concept of probabilistic participation of nodes  propose multi-level prediction mechanism to recover from sampling and data loss Uniqueness: multi-level prediction enables high accuracy even under high loss and voluntary non-participation Future Work  Develop online algorithm and exploit tradeoff between prediction accuracy and computation and storage cost  Build real system for applications in network health monitoring, traffic measurement and router statistics aggregation  Real system implementation and Deployment

19 June 2004 The Danger of Prediction Prediction Without Statistical Calibration Prediction With Statistical Calibration


Download ppt "Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,"

Similar presentations


Ads by Google