TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside
TIME 2002, Manchester, UK Contents 4 Background 4 Join problem definition 4 Straightforward approaches 4 Proposed join algorithms 4 Performance study 4 Conclusions
TIME 2002, Manchester, UK Background 4 Temporal record: (key, time interval) and some attributes. 4 TE-Join: two records qualify for join if their time intervals intersect; and their keys are equal.
TIME 2002, Manchester, UK Background 4 Our earlier work [ICDE02] solved a general TE-Join (GTE-Join), where portions from each relation are joined: the portion is selected via a range-interval selection: record keys should be in range r and time intervals should intersect interval i. interesting because (1) temporal relations are large; (2) TE-Join is a special case, when r and i are (- , + ).
TIME 2002, Manchester, UK Problem Definition 4 Semi-restrictive joins: records join if their keys are equal (GE-Join), or their intervals intersect (GT-join), but not both. 4 GE-Join: select a subset from X, a subset from Y, and join records from the subsets if their keys are equal. 4 GT-Join: select a subset from X, a subset from Y, and join records from the subsets if their intervals intersect.
TIME 2002, Manchester, UK Problem Definition 4 GT-Join example: find employees whose last names start with ‘B’ and who co-worked during 1995 with the employees whose last names start with ‘S’. 4 GE-Join example: find the 1998 IBM employees who were UC Riverside students in 1995.
TIME 2002, Manchester, UK GT-Join Solutions...
TIME 2002, Manchester, UK Straightforward Solutions for GT-Join 1. Unsynchronized join. 2. Synchronized join using B+-trees. 3. Synchronized join using R-trees.
TIME 2002, Manchester, UK 1. Unsynchronized join: separate the selection and join phases; not efficient because: 4 storing the intermediate result can be large; 4 selection in one relation ignores data distribution of the other relation. Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK 2. Synchronized using B+-trees. Not efficient: y needs to be checked against every record whose start is before end of y. If cluster on start: Cluster on end is similar. Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK Store each record as a two-dimensional interval in the R-tree; Use existing R-tree join algorithms [BKS93, HJR97]; Modifications: (1) integrate the selection condition; (2) join index records as long as they intersect in time dimension and ignore key dimension. However, not efficient since R-trees do not handle long intervals well. 3. Synchronized using R-trees. Straightforward Solutions for GT-Join
TIME 2002, Manchester, UK Our Solutions 4 Synchronized join using temporal indices. 4 Multi-version B+-tree (MVBT) [BGO+96]: asymptotically optimal space, update, query. 4 We propose three synchronized, MVBT- based join algorithms. (apply to other temporal indices as well)
TIME 2002, Manchester, UK Review of MVBT 4 A “forest” of trees: different trees may overlap. 4 Root nodes correspond to contiguous, non- intersecting time intervals. 4 A record may be stored in multiple pages. 4 Efficient range-interval selection algorithms.
TIME 2002, Manchester, UK Top-down GT-Join 4 Idea: for each pair of trees, one from each MVBT forest, synchronized tree traversal (STT). 4 STT for two trees: 4 Note that special care is needed to avoid duplicates, since a record has multiple copies. initially, join root nodes; to join two nodes, join their children; eventually, join elements in leaf pages.
TIME 2002, Manchester, UK Link-based GT-Join 4 In each leaf page, store a pointer to its predecessor. D find pairs of data pages that (1) intersect with the right border of the query rectangle; and (2) intersect with each other in time dimension; keep such pairs in priority queue; sweep left synchronously. 4 For GT-Join:
TIME 2002, Manchester, UK Plane Sweep GT-Join 4 Similar to link-based. 4 Maintain two priority queues, one for each MVBT. 4 At each step, access the leaf page with the largest end time and add records to buffer. 4 To add records to buffer, join with existing records from the other MVBT. 4 Throw away useless records.
TIME 2002, Manchester, UK GE-Join Solutions...
TIME 2002, Manchester, UK GE-Join Solutions... Similarly, we have: 4 unsynchronized 4 synchronized using B+-trees 4 synchronized using R-trees 4 top-down using MVBT 4 link-based using MVBT Note: some of them, especially the link-based algorithm, are quite different due to different join condition.
TIME 2002, Manchester, UK Implemented Algorithms Notation:Meaning: mvbt_dfSynchronized MVBT, depth-first mvbt_bfSynchronized MVBT, breadth-first mvbt_linkSynchronized MVBT, link-based r*_dfSynchronized R*-tree, depth-first r*_bfSynchronized R*-tree, breadth-first Common to both GT-Join and GE-Join:
TIME 2002, Manchester, UK Implemented Algorithms mvbt_psSynchronized MVBT, plane-sweep spjspatially partitioned join [LOT94] b+Synchronized B+-tree, index on key mvbt_smUnsynchronized, sort-merge after selection Specific to GE-Join: Specific to GT-Join:
TIME 2002, Manchester, UK Experimental Setup Implemented in GNU C++. Sun Enterprise 250 Server machine with two UltraSPARC-II processors using Solaris 2.8. Page size = 8KB. Buffer size = 10MB; LRU buffer. Each data set: 10 million records. R/I ratio: length of query key range divided by length of query time interval. It describes the shape of query rectangle.
TIME 2002, Manchester, UK GT-Join Performance R/I ratio = 10.
TIME 2002, Manchester, UK GT-Join Performance R/I ratio = 0.1.
TIME 2002, Manchester, UK GE-Join Performance R/I ratio = 10.
TIME 2002, Manchester, UK GE-Join Performance R/I ratio = 0.1.
TIME 2002, Manchester, UK Conclusions 4 We addressed index-based GT-Join and GE-Join. 4 Joins using traditional indices (B+-tree, R-tree) are not efficient. 4 We proposed various synchronized approaches based on temporal indices (MVBT). 4 Experiments: –for GT-Join, link-based and plane-sweep are the best; –for GE-Join, link-based and sort-merge are the best; –overall, link-based is the best: multi-fold improvement over B+-tree/R-tree joins.
TIME 2002, Manchester, UK