R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman

Introduction Range queries in multiple dimensions: –Computer Aided Design (CAD) –Geo-data applications Support spacial data objects (boxes) Index structure is dynamic.

R-Tree Balanced (similar to B+ tree) I is an n-dimensional rectangle of the form (I 0, I 1,..., I n-1 ) where I i is a range [a,b]  [- ,  ] Leaf node index entries: (I, tuple_id) Non-leaf node entry: (I, child_ptr) M is maximum entries per node. m  M/2 is the minimum entries per node.

Invariants 1.Every leaf (non-leaf) has between m and M records (children) except for the root. 2.Root has at least two children unless it is a leaf. 3.For each leaf (non-leaf) entry, I is the smallest rectangle that contains the data objects (children). 4.All leaves appear at the same level.

Example (part 1)

Example (part 2)

Searching Given a search rectangle S... 1.Start at root and locate all child nodes whose rectangle I intersects S (via linear search). 2.Search the subtrees of those child nodes. 3.When you get to the leaves, return entries whose rectangles intersect S. Searches may require inspecting several paths. Worst case running time is not so good...

S = R16

Insertion Insertion is done at the leaves Where to put new index E with rectangle R? 1.Start at root. 2.Go down the tree by choosing child whose rectangle needs the least enlargement to include R. In case of a tie, choose child with smallest area. 3.If there is room in the correct leaf node, insert it. Otherwise split the node (to be continued...) 4.Adjust the tree... 5.If the root was split into nodes N 1 and N 2, create new root with N 1 and N 2 as children.

Adjusting the tree 1.N = leaf node. If there was a split, then NN is the other node. 2.If N is root, stop. Otherwise P = N’s parent and E N is its entry for N. Adjust the rectangle for E N to tightly enclose N. 3.If NN exists, add entry E NN to P. E NN points to NN and its rectangle tightly encloses NN. 4.If necessary, split P 5.Set N=P and go to step 2.

Deletion 1.Find the entry to delete and remove it from the appropriate leaf L. 2.Set N=L and Q = . (Q is set of eliminated nodes) 3.If N is root, go to step 6. Let P be N’s parent and E N be the entry that points to N. If N has less than m entries, delete E N from P and add N to Q. 4.If N has at least m entries then set the rectangle of E N to tightly enclose N. 5.Set N=P and repeat from step 3. 6.*Reinsert entries from eliminated leaves. Insert non-leaf entries higher up so that all leaves are at the same level. 7.If root has 1 child, make the child the new root.

Why Reinsert? Nodes can be merged with sibling whose area will increase the least, or entries can be redistributed. In any case, nodes may need to be split. Reinsertion is easier to implement. Reinsertion refines the spatial structure of the tree. Entries to be reinserted are likely to be in memory because their pages are visited during the search to find the index to delete.

Other Operations To update, delete the appropriate index, modify it, and reinsert. Search for objects completely contained in rectangle R. Search for objects that contain a rectangle. Range deletion.

Splitting Nodes Problem: Divide M+1 entries among two nodes so that it is unlikely that the nodes are needlessly examined during a search. Solution: Minimize total area of the covering rectangles for both nodes. Exponential algorithm. Quadratic algorithm. Linear time algorithm.

Splitting Nodes – Exhaustive Search Try all possible combinations. Optimal results! Bad running time!

Splitting Nodes – Quadratic Algorithm 1.Find pair of entries E 1 and E 2 that maximizes area(J) - area(E 1 ) - area(E 2 ) where J is covering rectangle. 2.Put E 1 in one group, E 2 in the other. 3.If one group has M-m+1 entries, put the remaining entries into the other group and stop. If all entries have been distributed then stop. 4.For each entry E, calculate d 1 and d 2 where d i is the minimum area increase in covering rectangle of Group i when E is added. 5.Find E with maximum |d 1 - d 2 | and add E to the group whose area will increase the least. 6.Repeat starting with step 3.

Greedy continued Algorithm is quadratic in M. Linear in number of dimensions. But not optimal.

Splitting Nodes – Linear Algorithm 1.For each dimension, choose entry with greatest range. 2.Normalize by dividing the range by the width of entire set along that dimension. 3.Put the two entries with largest normalized separation into different groups. 4.Randomly, but evenly divide the rest of the entries between the two groups. Algorithm is linear, almost no attempt at optimality.

Performance Tests CENTRAL circuit cell (1057 rectangles) Measure performance on last 10% inserts. Search used randomly generated rectangles that match about 5% of the data. Delete every 10 th data item.

Performance

With linear-time splitting, inserts spend very little time doing splits. Increasing m reduces splitting (and insertion) cost because when a groups becomes too full, the rest of the entries are assigned to the other group. As expected, most of the space is taken up by the leaves.

Performance

Deletion cost affected by size of m. For large m: –More nodes become underfull. –More reinserts take place. –More possible splits. –Running time is pretty bad for m = M/2. Search is relatively insensitive to splitting algorithm. Smaller values of m reduce average number of entries per node, so less time is spent on search in the node (?).

Space Efficiency Stricter node fill produces smaller index. For very small m, linear algorithm balances nodes. Other algorithms tend to produce unbalanced groups which are likely to split, wasting more space.

Conclusions Linear time splitting algorithm is almost as good as the others. Low node-fill requirement reduces space- utilization but is not siginificantly worse than stricter node-fill requirements. R-tree can be added to relational databases.

The R*-tree: An Efficient and Robust Access Method for Points and Rectangles+ Norbert Beckmann, Hans-Peter Kriegel Ralf Schneider, Bernhard Seeger

Greene’s Split Algorithm Split: GS1: call ChooseAxis to determine axis perpendicular to the split GS2: call Distribute ChooseAxis: CA1: Find pair of entries E 1 and E 2 that maximizes area(J) - area(E 1 ) - area(E 2 ) where J is covering rectangle. CA2: For each dimension d i, find the normalized separation n i by dividing the distance between E1 and E2 by the length along d i of the covering rectangle for all the nodes. CA3: Return the dimension i for which n i is largest.

Greene Split Cont... Distribute: D1: Sort entries by low value along chosen dimension. D2: Assign the first (M+1) div 2 entries to one group and assign the last (M+1) div 2 entries to the other group. D3: If (M+1) is odd, assign the remaining entry to the group whose covering rectangle will be increased by the smallest amount.

Introduction R-trees use heuristics to minimize the areas of all enclosing rectangles of its nodes. Why? Why not... –minimize overlap of rectangles? –minimize margin (sum of length on each dimension) of each rectangle (i.e. make it as square as possible)? –optimize storage utilization? –all of the above?

Minimizing Covering Rectangle Dead space is the area covered by the covering rectangle which is not covered by the enclosing rectangles. Minimizing dead space reduces the number of paths to traverse during a search, especially if no data matches the search.

Minimizing Overlap Also reduces number of paths to be traversed during a search, especially when there is data that matches the search criteria.

Mimimizing Margin Minimizing margin produces “square-like” rectangles. Squares are easier to pack so this tends to produce smaller covering rectangles in higher levels.

Storage Utilization Reduces height of tree, so searches are faster. Searches with large query rectangles benefit because there are more matches per page.

Problems with Guttman’s Quadratic Split Distributing entries during a split favors the larger rectangle since it will probably need the least enlargement to add an additional item. When one group gets M-m+1 entries, all the rest are put in the other node.

Problems with Greene Split The “correct” dimension is not always chosen – splitting based on another dimension can improve performance sometimes. Tests show that Greene split can give slightly better results than quadratic split but in some cases performs much worse.

When Greene’s split goes bad... Overfull NodeGreene SplitCorrect Split

R*-tree - ChooseSubtree Let E 1,..., E p be rectangles of entries of the current node, ChooseSubtree(level) finds the best place to insert a new node at the appropriate level. CS1: Set N to be the root CS2: If N is at the correct level, return N. CS3: If N’s children are leaves, choose the entry whose overlap cost will increase the least. If N’s children are not leaves choose entry whose rectangle will need least enlargement. CS4: Set N to be the child whose entry was selected and repeat CS2. Ties are broken by choosing entry whose rectangle needs least enlargement. After that choose rectangle with smallest area.

ChooseSubtree analysis The only difference from Guttman’s algorithm is to calculate overlap cost for leaves. This creates slightly better insert performance. Cost is quadratic for leaves, but tradeoffs (for accuracy) are possible: sort the rectangles in N in increasing order based on area enlargement. Calculate which of the first p entries needs smallest increase in overlap cost and return that entry. For 2 dimensions, p=32 yields good results. CPU cost is higher but number of disk accesses are decreased. Improves retrieval for queries with small query rectangles on data composed of small, non- uniform distribution of small rectangles or points.

Optimizing Splits For each dimension, entries are sorted by low value, and then by high value. For each sort we create d = M-2m+2 distributions. In the k th distribution (1  k  d), the first group has the first (m-1)+k entries. We also have the following measures (R i is the bounding rectangle for group i) : 1.area-value = area[R 1 ]+area[R 2 ] 2.margin-value = margin[R 1 ]+margin[R 2 ] 3.overlap-value = area[R 1  R 2 ]

Optimizing Splits Split: S1: call ChooseSplitAxis to find axis perpendicular to the split. S2: call ChooseSplitIndex to find the best distribution. Use this distribution to create the two groups. ChooseSplitAxis: CSA1: for each dimension, compute the sum of margin-values for each distribution produced. CSA2: return the dimension that has minimum sum. ChooseSplitIndex: CSI1: for the chosen split axis, choose distribution with minimum overlap-value. Break ties by choosing distribution with minimum area-value.

Analyzing Splits Split algorithm was chosen based on performance and not on any particular theory. Split is O(n log(n)) in dimension. m = 40% of M yields good performance (same value of m is also near-optimal for Guttman’s quadratic split algorithm).

Forced Reinsert Splits improve local organization of tree. Can the improvement be made less local? Hint: during delete, merging underfilled nodes does very little to improve tree structure. Experimental results show that delete with reinsert improves query performance. Since inserts tend to happen more frequently than deletes, why not perform reinsert during inserts?

R* Insert InsertData ID1: call Insert with leaf level as the parameter. Insert(level) I1: call ChooseSubtree(level) to find the node N (at the appropriate level) into which we place the new entry. I2: if there is room in N, insert new entry, otherwise call OverflowTreatment with N’s level as parameter. I3: if OverflowTreatment caused a split, propagate OverflowTreatment up the tree (if necessary). I4: if root was split, create new root. I5: adjust all covering rectangles in insertion path.

R* Insert OverflowTreatment(level) OT1: If the level is not the root and this is the first OverflowTreatment for this level during insertion of 1 rectangle, call Split. Otherwise call Reinsert with level as the parameter. Reinsert(level) RI1: In decreasing order, sort the entries E i of N based on the distance from the center of E i to the center of N’s bounding rectangle. RI2: Remove the first p entries of N and adjust N’s bounding rectangle. RI3: call Insert(level) on the p entries in reversed sort order (close reinsert). Experimentally, a good value of p is 30% of M.

Insert Analysis Experimentally, R* insert reduces number of splits that have to be performed. Space utilization is increased. Close reinsert tends to favor the original node. Outer rectangles may be inserted elsewhere, making the original node more quadratic. Forced reinsert can reduce overlap between neighboring nodes.

Misc. R*-tree is mainly optimized for search performance. As an unexpected side-effect, insert performance is also improved. Delete algorithm remains unchanged (and untested) but should improve because it depends on search and insert.

Test Data (F1) Uniform – 100,000 rectangles. (F2) Cluster – Centers are distributed into 640 clusters of about 1600 objects each. (F3) Parcel – decompose unit square into 100,000 rectangles and increase area of each rectangle by factor 2.5. (F4) Real-Data – 120,576 rectangles from elevation lines from cartography data. (F5) Gaussian – Centers follow 2-dimensional independent Gaussian distribution. (F6) Mixed-Uniform – 99,000 uniformly distributed small rectangles and 1,000 uniformly distributed large rectangles.

Performance 6 different data distributions, including real- life cartography data. Rectangle intersection query. Point query. Rectangle enclosure query. Spatial Joins (map overlay).

Typical Performance Data

Spatial Join Test files: –(SJ1) 1000 random rectangles from (F3) join (F4) –(SJ2) 7500 random rectangles from (F3) join 7,536 rectangles from elevation lines. –(SJ3) Self-join of 20,000 random rectangles from (F3)

Point Access Method R*-tree had biggest wins for small query rectangles. What about points?

Conclusions R*-tree even better than GRID in read- mostly environments with 2-D point data. R*-tree is robust even for bad data distributions. R*-tree reduces # of splits and is more space efficient than other R-tree variants. R*-tree outperforms all other R-tree variants in page I/O.

Problems No test data for more than two dimenstions. R-tree algorithms are linear with dimensions. R*-tree isn’t: O(N log(N)). CPU cost not calculated. What about queries that retrieve lots of data? What if not all dimensions are specified? Linear scan performance?

When Is “Nearest Neighbor” Meaningful? Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishan, Uri Shaft

Introduction Nearest Neighbor (NN) – Given a set of data in m-dimensional space and a query point, find the point closest to a query point. A heuristic for similarity queries for images, sequences, video and shapes is to convert data to multidimensional points and find the NN. But in many cases, as dimensionality increases, d(FN)/d(NN)  1. d(FN) = distance to farthest neighbor d(NN) = distance to nearest neighbor

When is NN Meaningful?

NN Errors Conversion to NN is a heuristic. NN is frequently solved approximately.

Instability For a given  > 0, a NN query is unstable if the distance from the query point to most data points is less than (1+  ) times the distance to the nearest neighbor. In many cases, for any  > 0, as the number of dimensions increases, the probability that a query is unstable converges to 1.

Instability Theorem p is a constant, d m is a distance metric on m dimensions, P m,1,..., P m,n are n independent m- dimensional points, Q m is an m-dimensional query point, E[x] is the expectation of x. Theorem if then  >0

Bad Distributions The condition of the theorem holds for independent and identically distributed (IID) data – which is commonly used for high-dimensional index testing. Even many distributions with correlated dimensions have this problem. Distributions where the variance in distance with each added dimension converges to 0 can also have this problem. Ex: the i th component comes from [0,1/i]. Usually degradation is very bad with just 10-20 dimensions.

Meaningful Distributions Classification applications: data is clustered and query point must fall in one of the clusters. Meaningful NN query will return all points in the cluster. Implicit low dimensionality. For example, when all points lie on a line or in a plane.

Test on Artificial Data Uniform: IID uniform workload over (0,1) Recursive: data point X m =(X 1,...,X m ) –Take independent random variables U 1,...,U m where U i comes from uniform distribution over (0, ). –X 1 =U 1 and for 2  i  m, X i = U i + (X i-1 /2). Two degrees of freedom: –Let a 1, a 2,... and b 1, b 2,... be constants in (-1,1) –Let U 1, U 2 be independent and uniform in (0,1) –X i = a i U 1 + b i U 2

Results

Real Data – k Nearest Neighbor For k=1, 15% of queries had half the points within factor of 3

NN Processing For IID and other distributions where NN becomes meaningless, sequential scans may easily outperform fancy high-dimensional indexing schemes (like R*-tree). High-dimensional indexing schemes may be useful: they should be evaluated for meaningful workloads.

Conclusions For NN applications, check that your distribution is meaningfull. For NN processing, test algorithms on meaningful distributions. Compare your algorithm’s performance against linear scans.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Similar presentations

Presentation on theme: "R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Similar presentations

Presentation on theme: "R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman."— Presentation transcript:

Similar presentations

About project

Feedback