Download presentation
Presentation is loading. Please wait.
Published byKathlyn Parsons Modified over 6 years ago
1
Time Sensitive Computation of Aggregate Functions over Distributed Imprecise Data
Qi Han, Matthew Ba Nguyen Sandy Irani, Nalini Venkatasubramanian Distributed Systems Middleware Group School of Information and Computer Science University of California-Irvine
2
Motivation Real time applications often make decisions based on timely results aggregated over distributed data Example: real-time fault localization and problem diagnosis How many nodes are overloaded? (count) What is the total latency along the path N1N2 Nk? (sum) What is the bottleneck (minimum link bandwidth) along a path N1N2 Nk? (min) Recent advances in communication, mobile computing and embedded systems have enabled a variety of real-time distributed applications such as stock information exchange, environmental sensing etc. These applications often make decisions based on timely results aggregated over distributed data from varying source. For example, as distributed systems and networks continue to grow in size and complexity, real-time fault localization becomes more challenging. It is crucial to build an active real time information gathering system that can ask the right question at the right time.
3
timeliness, accuracy and cost-effectiveness
Challenges Continuous stream of fast changing source data Diverse user requirements in terms of data accuracy and service timeliness Effective utilization of underlying computation, communication and storage resources competing goals of timeliness, accuracy and cost-effectiveness Providing a real-time information architecture poses several challenges. Firstly, information sources provide a continuous stream of data that can dynamically vary over time. The information may need to be captured and stored rapidly and accurately. Secondly, users requiring access to this data present diverse requirements in terms of accuracy of the data and timeliness of the service. Thirdly, the collection should be done in an unobtrusive manner. Therefore, we are faced with the competing goals of timeliness, accuracy and cost. Fortunately, many applications are willing to tolerate information imprecision and bounded delivery latency. We would like to exploit these accuracy and latency margins to ensure that most applications receive information at the desired level of accuracy and timeliness while minimizing resource consumption.
4
This paper… Previous research This paper addresses
Tradeoff between accuracy and cost storing data in ranges Tradeoffs between accuracy, cost and timeliness (single data item) – RTSS’2003 This paper addresses Tradeoffs between accuracy, cost and timeliness for aggregate functions (count, sum, min) Probing part of the sources might be sufficient How the server selects an appropriate subset of sources to probe so that the overall probing cost is minimized without violating accuracy and timeliness requirements of aggregate queries Given the data intensive nature of the system, a database is a must for the system to function efficiently. Range-based approach… Directly applying the approaches to distributed real time applications will not be effective. Since queries will not only have accuracy constraints, but also time constraints which specify the latest time by which the results of an aggregate query are expected to be available. This paper complements previous work in RTSS 2003 by addressing how the server… Specifically,
5
Problem Characterization
s1: E1 [L1,U1] s2: E2 [L2,U2] si: Ei [Li,Ui] Sn: En [Ln,Un] …... server c1/d1 c2/d2 ci/di cn/dn Database [L1,U1],[L2,U2] … [Ln,Un] Queries: f(s1,s2,…,si…,sn), A, D A: accuracy constraint D: time constraint Example: min(s1,s2…,sn), 3, 1 minute System model: Source model: Query model: T_f varies (depending on whether it is parallel or sequential probing) After probing l=u=e Cost and latency in probing each source vary A: the actual value differs from the returned answer by at most A source 1 source 2 source i source n
6
Time-sensitive Computation of Aggregate Functions
Compute the function based on stored values in the database If the answer meets accuracy constraint: done Otherwise: select probing set Only consider SD: the subset of sources whose diD Batch selection The entire set of sources to probe is selected before the probes actually occur Iterative selection The source is probed one at a time In order to achieve the goal of minimizing the probing cost under time and accuracy constraints of user queries, we compute the function based on stored approximations in the database. If the answer does not satisfy the accuracy constraints of the user request, we decide on a set of sources to probe for exact values in order to improve the answer precision. Two basic approaches to probing set selection can be applied. In batch selection, the precision constraint must be guaranteed for any possible precise values for the sources in the probing set; iterative selection is an online approach, the function is evaluated every time after a source is probed and stopped when the answer is precise enough. The answer gradually refines to be more precise over time. In this case, the goal is to shrink the answer as fast as possible.
7
Batch Selection of Source Probing Set For Count (Batch_COUNT)
Problem: calculate the number of source values that fall inside the range r=[l,u]: fcount=|{si|si[l,u]}| l r u Inside s1 s2 Outside s3 s4 s5 Uncertain s6 Algorithm: if |U|>A: we must probe |U|-A sources to determine the function within the desired accuracy If |USD||U|-A: order the sources in |USD| according to increasing cost and probe the first |U|-A sources in this ordering o/w: cannot determine the the function to within the desired accuracy We can divide the sources into three sets: Inside, outside and uncertain.
8
Batch Selection of Source Probing Set for Sum (BATCH_SUM)
If we compute the function based on the stored intervals in the database, then the smallest possible sum occurs when all values are the lower bounds, and the largest possible sum occurs when all values are the upper bounds.
9
Batch Selection of Source Probing Set for Min (BATCH_MIN)
Example:
10
Issues in Computing min
In the case of computing count or sum Know in advance exactly the benefit of probing any particular source, thus can decide in advance which sources to probe count: probing any source decreases the uncertainty by 1 sum: probing source si decreases the uncertainty by ui-li In the case of computing min The number of probes required may vary depending on the values of the sources Example: s1=[0,5], s2=[1,6], A=1 If probe s1 and e1=2, then min=[1,2], done If probe s1 and e1=5, then min=[1,5], so s2 must be probed
11
Iterative Selection of Source Probing Set for Min
BATCH_MIN is a worst case analysis which assumes that the values returned always maximize the remaining uncertainty. Now we consider an average-case approach in which we assume that the value of each source is distributed uniformly over its range.
12
Performance Evaluation
Baseline policies compared with GREEDY: probe all LAZY: probe none RANDOM Performance metrics Cost Accuracy ratio (a/A): measures how close the answer interval a matches the accuracy constraint A Latency ratio (d/D): measure how close the time d spent answering the query matches the time constraint D Accuracy satisfaction ratio: the percentage of queries with their accuracy constraints met Deadline satisfaction ratio: the percentage of queries with their time constraints met When a/A<=1, the accuracy requirement is met; the smaller, the more accurate.
13
Basic Performance Results for Computing Count
Not surprisingly, GREEDY achieves the best answer accuracy at the price of highest cost and latency. In contrast, LAZY provides the most coarse answer instantly by not probing any sources. BATCH exhibits similar answer accuracy and latency to RANDOM with slightly lower probing cost. However, more queries meet their deadlines by using BATCH. This is because BATCH gives higher priority to time constraints than accuracy constraints, I.e., the best possible answer (in terms of accuracy) is provided only if the deadline is met.
14
Basic Performance Results for Computing Min
Comparing to RANDOM, the deadline satisfaction ratio of BATCH is much higher than RANDOM, since it does not probe those sources whose probing latency is higher than D. Iterative selection probes sources sequentially, the latency of a query is the sum of each probing latency. Given the same D the number of sources to be probed is decreased than BATCH, which leads to higher accuracy ratio (less accurate result) and lower accuracy satisfaction ratio.
15
Performance of Computing Count under varying accuracy constraints
When A is small, only probing sources in S_D cannot provide a satisfactorily accurate answer, that is why the first part of the curve is horizontal; When A increases, fewer probings can provide accurate answer.
16
Performance of Computing Count under varying time constraints
The several turning points in the curve of the probing cost matches exactly the several stages of the algorithms. We only probe S_D. Smaller deadline leads to small S_D, therefore probing cost increases as deadline increases. At the same time, the accuracy ratio decreases but still larger than 1. When the deadline reaches a point where we can select a subset of S-D to probe, the probing cost is decreased since we select those with smaller probing costs. When the deadline increases further, no more improvement will be obtained since the same subset of sources will be probed to meet the accuracy and time constraints.
17
Conclusions The worst case analysis (batch selection algorithms) provides a bound on the cost to satisfy queries regardless of the exact values of sources More sophisticated models such as Gaussian distribution can be used to capture the change of source values Also interesting to conduct competitive analysis of algorithms for answering aggregate queries Worst case analysis assumes that the values returned always maximize the remaining uncertainty
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.