ICDE-2006 Subramanian Arumugam Christopher Jermaine Department of Computer Science University of Florida 22nd International Conference on Data Engineering Closest-Point-of-Approach Join for Moving Object Histories 22 nd International Conference on Data Engineering
ICDE SELECT distinct (r, s) FROM R as r, S as s, TIME t WHERE dist (r, s, t) < 0.5 AND (r(t).altd - s(t).altd) ≥ AND (r(t).altd - s(t).altd) ≤ 1000 AND s(t) C AND r(t) C AND t ≥ 'JAN ’ AND t ≤ 'MAR ' “Find all commercial airliners that approached within 1000 vertical feet and 0.5 miles of a single engine plane in the BOS/JFK/EWR/LGA corridor C in the first three months of last year” CPA-Join Is Useful For Analysis Of Spatiotemporal Data Commercial airliners R, single engine planes S
ICDE Challenges 3-dimensional space + time Large # of objects Massive amount of data
ICDE CPA Illustration for Straight Line Trajectories Object p Object q CPA - Position at which two dynamically moving objects attain their closest possible distance
ICDE y x ,32 38,18 51,27 49,12 5,32 6,26 15,39 59,18 27,38 11,49 5,32 24,65 Time Object P Object Q y x Polyline approximation Sampled Positions Moving Object Trajectories dist cpa
ICDE Simple CPA-Join Procedure CPA (Object P, Object Q, distance d) 1. List result = {}; 2. for each pair of segments (p P, q Q) 3. if CPA_distance (p,q) d 4. result += (p,q); 5. return result; Need to compare only those segments whose time interval overlaps Plane sweep Find all object pairs (p P, q Q) from relations P and Q such that CPA-distance (p,q) d
ICDE CPA-Join using Simple Plane Sweep - First sort the segments in P and Q along time dimension (external sort) -While there is still some unprocessed data - Read in enough segments from P and Q to fill the main memory buffer -Next, sweep a vertical line along the time dimension. -Maintain a sweepline data structure which keeps tracks of all active segments that intersect the sweep line -As the sweep line progresses, the sweepline data structure is updated with insertions (new segments that became active) and deletions (segments whose time period has expired) -During updates to the sweepline structure, an all-pairs comparison returns valid results’
ICDE CPA-Join using Plane Sweep Sweep line has to pause at every new sample point encountered. Processing multi-gigabyte dataset can take a long time memory dis k
ICDE Group segments using a bounding box approximation dis k In the best case, just 1 comparison is needed memory dis k
ICDE Algorithm: Layered Plane Sweep While there is still some unprocessed data in disk Read in data from relations P and Q to fill in the buffer Construct MBR for the trajectory of every object in the buffer Sort MBRs along one of the spatial dimension and do a plane-sweep in it to identify qualifying MBR pairs Expand the MBRs to obtain the individual segments Sort segments along time dimension and do a plane-sweep along time to obtain the actual results
ICDE Layered Plane-Sweep Example But one size doesn’t fit all!
ICDE Indexes can be used to do CPA-Join -But (almost) all indexes use MBR approximation -And MBRs impose predefined granularities p q x y z A Note on Indexing
ICDE Layered Plane Sweep..what is the problem? Layered Plane Sweep always processes the entire fraction of data held in memory buffers When objects interact heavily such an approach may lead to no pruning at all In the best case, just one comparison is needed Though less buffer is processed initially, overall efficiency can be better Efficiency of layered technique is not tied to the amount of data processed, but to choosing a granularity that minimizes the # of distance computations
ICDE Cost to Process Data in Memory Buffer Cost can be approximated as a function of distance computations (which dominate execution time) cost = (n seg + n MBR ) where n seg is the # of segment level comparisons n MBR is the # of bounding box comparisons In general, cost for a fraction (0 ≤ ≤ 1) of the buffer cost = (n seg + n MBR ) * (1/ )
ICDE What we have Layered Plane Sweep processes large fraction ( is large) good when there is light interaction bad when there is heavy interaction Simple Plane Sweep processes tiny fraction ( is small) good when there is heavy interaction bad when there is light interaction What we want An Adaptive Algorithm processes a fraction that maximizes performance ( varies) Tunes to the characteristics of underlying data Provide superior performance under all scenarios
ICDE Algorithm: Adaptive Plane Sweep While there is still some unprocessed data in disk Read in data from relations P and Q to fill in the buffer Choose a fraction of the data that maximizes performance Process the chosen fraction of data using Layered Plane Sweep
ICDE How many fractions should we consider? How to estimate the cost for a given fraction ? “Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost” Goal: Choose a fraction of data that maximizes performance
ICDE Exact cost is known only after the fact! To know the cost associated with a given , we need to actually execute the join (layered plane sweep) at that granularity How to estimate Cost for a given fraction Estimate cost using a simple online sampling algorithm [HH97]
ICDE Cost Estimation through sampling Given: Relations P and Q and alpha Consider segments within Construct MBRs for the objects in P Until the estimate of cost is accurate to within +/- 10% –Pick randomly an object q 1 from Q and construct a MBR for its trajectory –Join q 1 with all objects in P –Compute n MBR,q1 and n seg,q1 –Estimate cost How to estimate Cost for a given fraction (Contd.)
ICDE How many fractions to consider? –Computing cost for all not practical..it will offset any benefit that we gain from the adaptive technique..we need a strategy to limit the # of fractions that we process “Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost”
ICDE How many fractions to consider? vs cost graph is not linear, it exhibits convexity Convex region represents the candidate region with the minimum cost We can get-away with evaluating the cost for a small k fractions of Fraction considered Cost (millions)
ICDE How to choose the k fractions? K = 10; t start =32; t end =53 FractionTime rangeCost 1 = 0.11 [ ]90 2 = 0.14 [ ]71 3 = 0.18 [ ]52 4 = 0.23 [ ]37 5 = 0.30 [ ]31 6 = 0.38 [ ]35 7 = 0.48 [ ]41 8 = 0.61 [ ]52 9 = 0.78 [ ]59 10 = 1.0 [ ]71 Acceptable candidates r = t end - t start 1 = r (1/k) /r i = (r. 1 ) i /r Fraction chosen can be fine-tuned through recursive calls
ICDE Putting it all together Fill Buffer Optimizer Layered Plane Sweep More data? Relation R, S; distance d; Parameter k Evaluate k fractions, choose best Process join on best fraction Read from relations R and S
ICDE Benchmarking Code: Implemented and tested the various alternatives in C/C++ –R-Trees, Simple Sweep, Layered Sweep, Adaptive Sweep with various parameter settings Workload: 2 relations, 100,000 objects (50 GB) –Physics-based Simulation data set –Synthetic data set Hardware: Linux 2.4 GHz pentium Xeon, 1 GB Main memory, 2 IDE drives 15,000 rpm Setup: 64 KB page size, buffer size 10,000 pages
ICDE Collision Data Set 100,000 objects, collision occurs during time range [ ] Snapshot at timetick 1500
ICDE Results - Execution Time for different Strategies % of join completed Execution time (seconds) R-tree simple sweep layered sweep adaptive sweep K=20 K=10 K=5
ICDE Buffer Choices made by the optimizer Virtual time line in the data set Fraction of buffer chosen
ICDE Discussion R-trees couldn’t do enough pruning to make a difference Simple plane-sweep works well when there is heavy interaction among objects Layered plane-sweep works well when there is light interaction Adaptive version transitions smoothly between these extremes Recursive call to fine-tune candidate region doesn’t seem to help much
ICDE Conclusion… CPA-Join for spatiotemporal relations Proposed a novel adaptive join algorithm for moving object histories based on extension of the plane-sweep Many practical applications
ICDE Questions? Thank You! Subramanian