Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj, June, 2004
June 2004 Outline Background Motivation Statistical properties of real life data streams Problem of existing approaches Our Approach Reduce communication overhead Recover from loss Evaluation Conclusion and future work
June 2004 Background Aggregate functions MIN, MAX, AVG, COUNT, …, etc. In-Network hierarchical processing Query propagation Tree construction Aggregates computed epoch by epoch Addressing fault-tolerance Multi-root Multi-tree Reliable transmission A B C D E Count ?
June 2004 Motivation Data aggregation is an important function for all network infrastructures Sensor networks P2P networks Network monitoring and intrusion detection systems Exact result not achievable in face of loss and faults High cost when adding fault-tolerance Low communication overhead, accurate approximation is crucial But, it’s difficult to achieve
June 2004 Observation: Comparison of Data Streams Three real-world data traces and a random trace
June 2004 Statistical Properties of Data Streams Density estimation for relative increment There is temporal correlation in real data stream, by which we can leverage to maintain aggregate data accuracy, while reducing communication overhead and recovering from data loss. Relative Increment is defined as:
June 2004 Problems in Existing Approaches Few approach exploits the temporal properties and is designed to handle data loss Simple last-value algorithm for data loss recovery in TAG Multi-root/tree make things worse by consuming more resource Fragile for large process groups Need all relevant nodes for participation Difficult to trade accuracy for communication overhead Good applications need this tradeoff Only need approximation But, minimize resource consumption Centralize solution of adaptive filtering proposed by Olston et.al.
June 2004 Our Approach Probabilistic data aggregation: a scalable and robust approach Exploit and leverage statistical properties of data stream in temporal domain Apply statistical algorithms to data aggregation Develop protocol that handles loss and failures as essential part of normal operations Nodes participate in aggregation and communication according to statistical sampling algorithm In the absence of data, estimate value using time series algorithms Differentiate between voluntary and involuntary Loss
June 2004 Reducing Communication Overhead Trade off between accuracy and resource consumption Allow selective participation of nodes while maintaining aggregate accuracy Node participates in the operation with certain probability, which is the design parameter of the algorithm Sampling strategies: Uniform Sampling: all nodes use the identical sampling rate Subtree-size based Sampling: sampling rate of a node is proportional to the size of its subtree Variance based sampling: a sensor only reports a new value if it is above or below a threshold percentage its last reported value.
June 2004 Performance of Sampling algorithms As fewer nodes participate, overall accuracy decreases for all algorithms. Uniform sampling performs worst. Variance based sampling is most accurate, Max Operation AVG Operation
June 2004 Observation: Long-Term Pattern in Data Data source: bandwidth measurements for the CUDI network interface on an Abilene router with 5-minute average. Daily patterns in a weekly data stream Long-term trend
June 2004 Two Level Representation of Data The data stream can be decomposed into two layers: the long trend (pattern), which changes slowly; the residual, high frequency but low amplitude. Monday Data Long-term trend
June 2004 Recovering From Loss Traditional Approaches Last seen data as approximation for current epoch Linear Prediction Two-Level data representation and prediction Long term trend: B-spline estimation High frequency residual: ARMA modeling ARMA stands for AutoRegressive and Moving Average model, which is a standard time series technique to model chaotic data stream
June 2004 Two-Level Data Prediction B-spline modeling for long term trend Piecewise continuous, low-degree B-spline can represent complex shapes Least-square B-spline regression for two-level decomposition B-Spline extension for future forecasting ARMA forecasting for transient oscillation System Identification to determine the order of the model Parameter estimation by optimization algorithm Low complexity recursive equation for future forecasting Statistical properties for the calibration of prediction results
June 2004 Performance of Prediction Algorithms Performance of Prediction Algorithms For MAX Operation in Lossless Environment
June 2004 Performance of Prediction Algorithms Performance of prediction algorithms in lossy environments. Average loss rate of the network is 20%. The ration of loss rate between wide-area links and local links is 3:1.
June 2004 Summary of Results All prediction algorithms are effective in improving the accuracy of aggregation results Two-level prediction approach perform the best in all situations Achieve more than 90% of accuracy even under each node nonparticipation with rate up to 60% Is effective even in a high loss environment
June 2004 Conclusion and Future Work Apply statistical algorithms to data aggregation system quantify the statistical properties of real-world measurement data propose the concept of probabilistic participation of nodes propose multi-level prediction mechanism to recover from sampling and data loss Uniqueness: multi-level prediction enables high accuracy even under high loss and voluntary non-participation Future Work Develop online algorithm and exploit tradeoff between prediction accuracy and computation and storage cost Build real system for applications in network health monitoring, traffic measurement and router statistics aggregation Real system implementation and Deployment
June 2004 The Danger of Prediction Prediction Without Statistical Calibration Prediction With Statistical Calibration