Efficient Computation of Temporal Aggregates with Range Predicates D. Zhang *, A. Markowetz **, V. J. Tsotras *, D. Gunopulos * and B. Seeger ** * University of California, Riverside ** Philipps Universität Marburg, Germany
Outline Introduction & Motivation Problem Decomposition The MVSB-tree Performance Results Conclusions
Introduction & Motivation Consider a collection of temporal records. Each record: key k, value v, time interval [t 1, t 2 ]. E.g.: employees and their salaries over time. Temporal Aggregation: aggregate values over time. Focus on SUM/COUNT/AVG. Introduction & Motivation
Previous Work ‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00] Introduction & Motivation
Previous Work ‘Given interval [t 1, t 2 ], aggregate over all records that intersect [t 1, t 2 ]’. (SB-tree [YW01]) E.g. the sum at t 2 is 13. ‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00] Introduction & Motivation
Previous Work E.g. the sum over [t 1, t 2 ] is 28. ‘Given interval [t 1, t 2 ], aggregate over all records that intersect [t 1, t 2 ]’. (SB-tree [YW01]) E.g. the sum at t 2 is 13. ‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00] Introduction & Motivation
Range-Temporal Aggregation (RTA) ‘Aggregate over all records intersecting interval [t 1, t 2 ] with keys in range [k 1, k 2 ]’. E.g. the RTA-sum over [k 1, k 2 ]x[t 1, t 2 ] is 19. Introduction & Motivation
Range-Temporal Aggregation (RTA) ‘Aggregate over all records intersecting interval [t 1, t 2 ] with keys in range [k 1, k 2 ]’. E.g. the RTA-sum over [k 1, k 2 ]x[t 1, t 2 ] is 19. Introduction & Motivation Find AVG salary over past ten years of all employees whose last names start with ‘B’.
Alternative: Introduction & Motivation Previous approaches would need a separate index for each possible key range. (inefficient) Our solution: O(log b n). -index the records; -selection query: ‘find all records intersecting [k 1, k 2 ]x [t 1, t 2 ]’. -Query time is O(n).
Problem Decomposition LKST query: given k, t, aggregate over all records with keys less than k and intervals containing t. Problem Decomposition Decompose RTA into LKST and LKLT queries. E.g. LKST(k 2, t 2 )=11.
LKLT query: given k, t, aggregate over all records with keys less than k and intervals ending before t. Problem Decomposition E.g. LKLT(k 2, t 2 )=20.
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ])
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) - LKST(k 1, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) LKLT(k 2, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) - LKLT(k 1, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) LKLT(k 2, t 2 )- LKLT(k 1, t 2 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) LKLT(k 2, t 2 )- LKLT(k 1, t 2 )LKLT(k 2, t 1 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) LKLT(k 2, t 2 )- LKLT(k 1, t 2 )- LKLT(k 1, t 1 )
= + - Problem Decomposition RTA([k 1, k 2 ]x[t 1, t 2 ]) LKST(k 2, t 2 )- LKST(k 1, t 2 ) LKLT(k 2, t 2 )- LKLT(k 1, t 2 )LKLT(k 2, t 1 )- LKLT(k 1, t 1 )
RTA([k 1, k 2 ]x[t 1, t 2 ]) =LKST(k 2, t 2 )- LKST(k 1, t 2 ) + LKLT(k 2, t 2 )- LKLT(k 1, t 2 ) - LKLT(k 2, t 1 )+ LKLT(k 1, t 1 ) The RTA query is decomposed to LKST and LKLT. Problem Decomposition
Both LKST and LKLT are point queries: ‘given k, t, return value’. An index for LKST and LKLT should: store points in key-time space; maintain a value for each point; support point queries. Index Design
Model Assume updates come in increasing time order (transaction-time model). at t 1, inserted as: at t 2, updated as: Index Design a record:
The LKST index at t 1 The effect of inserting record (k, [t 1, t 2 ], v): at t 2 Index Design
The LKLT index no update at t 1 Index Design The effect of inserting record (k, [t 1, t 2 ], v): at t 2
Update Operation Common update operation for both: insert (k, t):v. Index Design That is: add v to all points in [k, t] x [k max, t max ]. Conclusion: an index supporting point query and the above update can be used for LKLT and LKST.
The MVSB-tree A partially persistent SB-tree. It inherits features from both the SB-tree [YW01] and the MVBT [BGO+96]. The MVSB-tree
Insertion The MVSB-tree
Insertion (cont.) The MVSB-tree To handle overflow, copy records with end=t max to a new page.
Insertion (cont.) The MVSB-tree To handle overflow, copy records with end=t max to a new page. copy Strong overflow: limit the number of records in a new page. root 2 : [4, t max ) root 1 : [1, 4)
Point Query (k, t ) Follows a single path: the nodes containing (k, t ). Aggregates the values found in this path. The MVSB-tree
Point Query (k, t ) Follows a single path: the nodes containing (k, t ). The MVSB-tree E.g.: PointQuery(23, 7) = 5+2 = 7. Aggregates the values found in this path.
Efficiency Theorem: with 2 MVSBT indices, we achieve: RTA query: O(log b n); Update: O(log b K); Space: O( * log b K). n = number of updates; K= number of different keys; b = page capacity (in records). The MVSB-tree
Performance Results Sun Enterprize 250 Server; two 300 Mhz Ultra SPARC-II processors; Solaris 2.8; GNU C++; Datasets: created using the TimeIT [KS98] software and transformed to add record keys. Each dataset has a million records (10k unique keys; on average 100 intervals per key). Compare against the straightforward approach using the MVBT [BGO+96] as temporal index. Performance Results
Index Sizes Performance Results
Query Speedup Query time is averaged over 100 queries of the same query rectangle size.
Conclusions We addressed the range-temporal aggregation (RTA) problem; New index structure (MVSB-tree) for incrementally maintaining and efficiently computing RTAs; Query time reduced from O(n) to O(log b n) with small space overhead; Open problems: Min/Max range-temporal aggregation; Valid-time environment; Multi-dimensional aggregation over objects with extents.