Performance Comparison of xBR-trees and R*-trees for Single Dataset Spatial Queries Performance Comparison of xBR-trees and R*-trees for Single Dataset Spatial Queries George Roumelis (Master in Information Systems, Open University of Cyprus, Cyprus) Michael Vassilakopoulos (*) (Dept. of Computer Science and Biomedical Informatics, University of Central Greece, Greece) Antonio Corral (Department of Languages and Computing, University of Almeria, Spain) * speaker Fifteenth East-European Conference on Advances in Databases and Information Systems September 19th–23rd, 2011, Vienna, Austria
Outline Motivation Contribution and Background R-tree and R*-tree XBR-tree, Internal Nodes and Leaf Nodes Single Dataset query processing on XBR-trees Experimental Results: Settings,Tree Building, Single Dataset Queries Conclusions and Future Work 2
Motivation §In applications, a variety of Spatial Queries arise, l involving one spatial dataset, like Point Location, Window, Distance Range, Nearest Neighbor Queries l involving two spatial datasets, like Distance Join, Closest Pair, All-Nearest Neighbor Queries §Usually, there is no overall performance winner §Many researchers have compared different Access Methods, regarding their I/O and execution time performance, for a variety a Spatial Queries §In this work, we compare the popular R*-tree and the External Balanced Regular (xBR) tree for Single Dataset Spatial Queries 3
Contribution We implement the xBR-tree and present conclusions arising from the (real data based) experimental comparison of xBR-trees and R*-trees regarding I/O Performance and Execution Time for §Tree building §Point Location Queries (PLQs) §Window Queries (WQs) §Distance Range Queries (DRQs) §K-Nearest Neighbor Queries (K-NNQs) §Constrained K-Nearest Neighbor Queries (CK-NNQs) 4
Background (1) §Given an index I and a query point q, the PLQ returns true if q belongs to I and false otherwise §Given an index I and a query rectangle r, the result of the WQ is the set of all points in I that are completely inside r §Given an index I, a query point q and a distance threshold delta >= 0, the DRQ returns all points of I that are within the delta distance from q (according to a distance function) 5
Background (2) §Given an index I, a query point q, and a value K > 0, the K-NNQ returns K points of I which are closest to q (according to a distance function) §Given an index I, a query point q, a value K > 0 and a distance threshold delta >= 0, the CK-NNQ returns K closest points of I which are within delta distance from q (according to a distance function) 6
R10 R11 R12 R1 R2 R3R4 R5 R6R7 R8 R9 Nodes that contain points R-Tree Nested MBRs can be organized as a tree (R-tree) R1 R2 R5 R3 R7 R9 R6 R4 R8 R10R11 R12 7 Clusters of spatial objects can be recursively grouped into Minimum Bounding Rectangles – MBRs
R*-Tree 8 The R*-tree is the most popular R-tree variation It added two major enhancements to the original R- tree, when a node overflow is caused l First, rather than just considering the area, the node-splitting algorithm in R*-trees also minimizes the perimeter and overlap enlargement of the MBRs l Second, an overflowed node is not split immediately, but a portion of entries of the node is reinserted from the top of the R*-tree (forced reinsertion)
xBR-trees can be defined for various dimensions For the ease of exposition, we assume 2 dimensions For 2 dimensions the hierarchical decomposition of space is that of Quadtrees. The space indexed by an xBR-tree is a square, expressed in a coordinate system of real numbers The nodes of xBR-trees are disk pages and are distinguished in two kinds: leaves, which store the actual multidimensional points and internal nodes, which provide a multiway indexing mechanism for these data xBR-Tree 9
Each node consists of a non-predefined number of entries of the form An address is formed by directional digits (NW, NE, SW, SE), determines the region of a child node and is accompanied by the pointer to this child Shape is a flag that determines if the region of the child is a complete square (used widely in queries) REG stores the coordinates of the region referenced by address (it is more expensive to calculate it) The region of a child is the subquadrant of its address minus the subquarants of the next addresses in the same node Internal Nodes 10
The region of the root is the original space (a quadrangle) The * symbol denotes the end of an address The address of the right child is 0*, since the region of this child is the NW quadrant of the original space The address of the left child is *: its region is the whole space minus the region of the right child An xBR-tree example 11
Queries are processed in a top-down manner on the xBR-tree, like on the R*-tree During a PLQ for a specified point, the appropriate leaf is determined by descending the tree from the root (unlike the R*-tree, a single path is followed) During WQs, we examine if the subquadrant of the current internal-node entry and the query window intersect and follow the pointer to the related child We repeat until we have examined all entries of the internal node, or until the query window is completely inscribed inside the region of the entry that we examine Processing PLQs and WQs 12
The DRQ follows the same strategy as the WQ At first, the querying circle is replaced from its MBR (the calculations are faster in this way) and if the answer about the intersection of the subquadrant of the current entry and the query MBR is positive, then we follow the pointer to the related child at the next lower level If we reach a leaf with a region that intersects the query MBR, we select the points in the leaf that are inside the query circle Processing DRQs 13
The K-NNQ algorithm follows a DF tree traversal In an internal node, entries are visited according to their mindist from the query point The process is repeated recursively until the leaf level is reached, where a potential next NN is found It is possible to reach a leaf, but the next NN may exist in a neighboring region. Thus, we use a global max K- heap and insert in it every point of this leaf that is nearest to the query point than the root of the heap When the heap is full, according to a set of conditions*, the search is stopped *Roumelis et al., Nearest Neighbor Algorithms Using xBR-Trees, PCI 2011 Processing K-NNQs 14
Experimental Results We used 5 real datasets of different sizes (CSN: line- segments, NApp: points, NAcl: 9203 points, Narr: line-segments, Nard: line-segments) To create 2d point datasets from non-point datasets, we used the centroids of the line-segment MBRs Environment used: Linux machine, Intel core duo 2x2GHz processor, 3 GB of RAM, gcc Performance measurements: I/O activity (page accesses) and Execution Time An extended set of experimental results is accessible from: Experimental settings 15
Tree Construction In all cases, the xBR-tree uses less space (i.e. it is more compact) and time than the R*-tree (the R*-tree creation is slower, partially, due to the use of forced reinsertion that improves searching efficiency) The difference in creation time is enlarged as the size of node increases 16
Point Location Query The xBR-tree needs less read accesses and executes every query faster than the R*-tree The xBR-tree needs a number of disk accesses equal to its tree height, while the R*-tree needs at least this number of access and, in most cases, even more 17
Window Query (1) The xBR-tree needs more page accesses As the size of node increases the I/O difference between the two trees becomes smaller In both trees, a linear dependence of the number of accesses to the size of the node appears This is due to reduction of tree height as the size of node increases 18
Window Query (2) Query windows that were inhabited by points (non-empty windows) were used The xBR-tree becomes clearly faster (execution time) for all sizes of nodes and the I/O efficiency of the two trees is closer Main memory processing is simpler (and thus faster) for the xBR-tree 19
Distance Range Query Τhe xBR-tree needs less disk accesses and is faster than the R*-tree, in all cases and for all datasets The results were even better when the DRQs addressed only non- empty regions 20
K- Nearest Neighbor Query The xBR-tree needs more disk accesses than the R*-tree, but the difference gets smaller when the size of node increases Regarding the execution time, the xBR-tree shows improved performance, in relation to its I/O difference from the R*-tree The worse time performance of both trees is for larger node sizes This is due to the fact that as the node size increases, the trees become very wide and very short 21
Constrained K-NN Query Studying the results of several CK-NNQ experiments, we see that the xBR-tree is improved for both performance categories Depending on the dataset, for non-empty regions the CK-NNQ time performance of the xBR-tree is almost the same to, or much better than the R*-tree 22
Conclusions We performed an extensive (real data based) experimental comparison of the performance of xBR-trees and R*-trees, for spatial queries where a single index is involved The conclusions arising from this comparison show that the two structures are competitive The xBR-tree is smaller and is built faster than the R*-tree The performance of the xBR-tree is higher for PLQs and DRQs and for WQs when the query window is non-empty The R*-tree is better for K-NNQs and needs less disk access for CK-NNQs The execution time winner for CK-NNQs depends on whether the query returns result points, or not 23
Future Work To extend the xBR-tree for modeling empty regions too To use in the xBR-tree internal-node address representation that facilitates main memory computation To study the relative performance of the two structures using memory consuming BF algorithms To study the relative performance of the two trees for two dataset (join) queries To study the relative performance of the two structures in the presence of buffering 24
Thank you for your attention