Spatial Databases - Indexing Spring, 2015 Ki-Joune Li
What is Indexing ? Indexing : Fight against TIME Example Suppose that you have a Hamlet, and you want to know the name of Hamlet’s father. Without Index : Full (Sequential) Scan of the book With Index : Direct Access to the Page Hamlet
Some Constraints Modern Database What should we do ? Very Huge Volume : e.g. several peta bytes Storage on Disk Inevitable But slow (cf. main memory) : msec. vs. nano sec. Even in Main Memory Database System What should we do ? Minimize the number of Disk Access
The Objective of Indexing Database in Disk Query Condition Disk Address (Block Number) Indexing
Classification of Indexing According to the type of query and data Alphanumeric query Image Spatial What is the nearest post office to the Louvre Museum ? Spatial predicate Spatial Index Database in Disk Spatial Query Disk Address (Block Number)
Spatial Query Sophisticated Types of Spatial Query One Scan Query Region Query : Containment, Intersection K-Nearest Neighbor Query Multi-Scan Query : Join Spatial Join Distance Join Spatial Query Processing Tightly coupled with Spatial Indexing Method
Spatial Processing Strategy Filtering and Refinement Strategy Index Verification of Geometry Complete Data Candidates Spatial Query Result Simplification of Geometry Filtering Refinement 1. More Light Index : e.g. < 1 M bytes 2. Remove Unnecessary Disk Accesses
Classification of Spatial Indexing Methods Hashing and Indexing Index (in wide sense) Hashing, Indexing (in narrow sense) Space Decomposition vs. MBR Decomposition of a space : Whole Space Bounding Rectangle : Only Interesting Area Dimensionality No Transformation to Higher Dimension To Lower Dimension : Linearization
Indexing vs. Hashing Hashing Indexing (in narrow sense) 1. b = h(r.key) 2. Store(r, b) Block number is determined by hashing function or mechanism Only for primary index Search by a hashing function Indexing (in narrow sense) 1. b = Store(r ) 2. Insert(B, (r.key, b) ) Block number is independent from indexing mechanism For primary or secondary index Search by a data structure called index
Decomposition vs. Bounding Region
Decomposition Methods Grid File : An Extension of Hashing to 2-D Variation Fixed Grid Grid File Multi-Level Grid File Hierarchical Data Structure KD-tree Quadtree skd-tree etc.
Fixed Grid Most Simple Method Minimum Data for Hashing 1 Disk Page Query Window 20 30 40 50 10 1. Find intersecting grids 2. Find corresponding blocks 3. Read objects from the blocks 4. Refinement
Problems of Fixed Grid Only for Point Object Large Dead Space Object with measure : duplicated storage Degrade performance Large Dead Space Causes Unnecessary Disk Accesses Not very Flexible On Distribution Query Window 20 30 40 50 10
Grid File To overcome problems of Fixed Grid Reduce Dead Space within a cell Increase Blocking Factor Query Window Directory Grid Boundary Block# A (0,0),(15,20) Page 0 B (15,0),(30,20) Page 1 . . . I (30,28),(50,40) Page 15 40 28 20 15 20 30 50
Blocking Factor A Key Factor on performance How to increase Bf ? Number of Objects in a Disk Block Number of Disk Accesses How to increase Bf ? Increase Block Size : not always possible Packing
Problems of Fixed Grid Only for Point Object Still Large Dead Space Large Size of Directory Directory Grid Boundary Block# A (0,0),(15,20) Page 0 B (15,0),(30,20) Page 1 . . . I (30,28),(50,40) Page 15
Hierarchical Decomposition To overcome the size of directory in Grid File Hierarchical Structure of Directory Acceleration of Search
KD-tree : Index Extension of Binary Tree to K-Dimension (K=2 for us) Example : suppose Bf =3 A Directory B E x=20 y=20 y=10 x=30 =< < 15 A B E 10 D Each leaf node points to the disk page A C C D 20 30
KD-tree : Search B E x=20 =< < y=20 y=10 15 A A B x=30 E 10 D A
Weak Points of KD-tree Only for Point Objects Dead Space How to Store Tree Structure on Disk Space Blocking Problem Widely used for main memory index Rarely used for disk resident index Unbalanced Tree Zipf’s Law (or 80/20 law) Most events are concentrated Leads highly skewed tree B E D A C
Quadtree Extension of KD-tree : KD-tree : binary split Quadtree 4-way equi-split instead Example : Bf =3 C D F A F Each leaf node points to the disk page B E B C D E G H I J H J G A I
Weak Points of Quadtree Same Problems of KD-tree In addition to the lack of flexibility Only for Point Objects Dead Space How to Store Tree Structure on Disk Space Blocking Problem Widely used for main memory index Rarely used for disk resident index Unbalanced Tree Zipf’s Law (or 80/20 law) Most events are concentrated Leads highly skewed tree
Point Quadtree A Simple Variation of Quadtree Specification of Partition Point instead of equi-split More Adaptive to the distribution of objects Less Skewed (10,20) (5,25) A (5,25) F (35,10) (10,20) B C D E G H I J (35,10)
Linear Quadtree : Space-Filling Curve Quadtree but another representation Linearization by Space-Filling Curve 11 6 13 N-order Hilbert Column-wise Linearize points(or cells) by their peano-key
Linear Quadtree Example : N-order curve Computation of Peano-Key : Bit-Interleaving 11 1. Binary representation of coordinates (10,01) 10 2. Bit-Interleaving x = 1 0 y = 0 1 01 Peano key = 1 0 0 1 00 = 9 00 01 10 11
MBR Methods MBR (Minimum Bounding Box) R-tree and its variants Two dimensional geometric simplification of objects Not the Whole space, only in the region occupied by objects R-tree and its variants (X1max, X2max ) (X1min, X2min)
R-tree Construction of R-tree : Sequence of Insertion Upward Split B C E A H F G I B C D D E F G H I J K J K A Leaf node points to the disk page 2-D Objects Construction of R-tree : Sequence of Insertion Upward Split
Splitting in R-tree Split MBR in the case of overflow Line sweeping : Compare Cost-X and Cost-Y New MBR Splitting Line Cost Measure Area, Perimeter Overlapping Area
R-tree : Query Processing B C E A A H Query Region W F B B C C D D I G D E E F F G G H H I I J J K K J Candidate K A Read its exact geometry from databaseCandidate Refinement Sample : http://www.dbnet.ece.ntua.gr/~mario/rtree/
Strength of R-tree For point and non-point Objects Good for non-uniform distribution Paged Tree Hierarchical Structure but Balanced Less Dead Space than Decomposition Methods A B C D E J K C D H I E F G
Weak Points of R-tree : Overlapping Area Overlapping : False Matching Query Region A B C J D E F K G H I L M A B G C L H K J D I K E F M False Matching : Visit unnecessary node Performance Degradation
Weak Points of R-tree : Dead Space Query Region A B G C L H J D I E K F M At least one visit at this node (K) even though there is nothing
Weak Points of R-tree : Bad Split Good Split Bad Split 1. Make them as COMPACT as possible 2. Preserve spatial proximity as possible
Improvement of R-tree Minimize Or Make it more COMPACT Overlapping area Dead Space Or Make it more COMPACT Preserve Spatial Proximity Two approaches Packing (or Bulk Loading) Good Split or Insertion Strategies
R*-tree : An Improvement of R-tree Re-Insertion Strategy on Overflow Overflow Newly Inserted Object Delete and Re-Insert this
R*-tree : An Improvement of R-tree Re-Insertion Strategy on Overflow More Compact Re-Inserted Object
R*-tree : An Improvement of R-tree Compact Small Overlapping Area Small Sum of MBR area or perimeters Small Dead Space Stable : Not very affected by the order of insertions The most widely used spatial indexing method
Packing R-tree : Improvement of R-tree Preprocessing for making R-tree more compact Hilbert R-tree STR (Sort-Tile Recursive) Uniformization Instead of Sequential Insertions
Hilbert Packing Hilbert Curve A Space Filling Curve Linearize spatial objects by their peano-key N-order Hilbert Column-wise
Hilbert Packing Hilbert Packing Example: Bf =3 Sort objects by Hilbert key Packing by round-robin way Maximize storage utilization Minimum Dead Space, and Sum of MBR area Example: Bf =3
STR (Sort-Tile Recursive) Basic idea : “tile” the data space using vertical slices r : number of rectangles n : blocking factor P ( leaf node page ) = Example Suppose r = 25, n =3 nTile = 9, nV = 3, nH = 3
Comparison : Hilbert Packing vs. STR HP Large Objects STR HP Points STR
Uniformization Non-Uniform Distribution Uniformization Technique Negative Effect on the performance But in real applications : Non-Uniform Uniformization Technique Step 1 : Transform Non-Uniform data to Uniform by STR Step 2 : Apply R-tree (or Fixed Grid) Step 3 : Transform Query Region Strength High Storage Utilization Very Simple and Good Performance
Uniformization Non Equi-Width Equi-Width 1. Area of each cell : identical 2. Number of objects within each cell : almost identical
Uniformization : Example By Delaunay Triangulation By STR Original
Uniformization : Example Original By STR
Query Processing by R-tree : Nearest Neighbor Query Point Searching Space 2nd Distances in 2-D Minimum
Query Processing by R-tree : Nearest Neighbor Branching Branching Pruning Minimum
Transformation to Higher Space Transformation to Higher Dimension Transform non-point object to point object Reuse of spatial indexing methods (e.g. Grid File) applicable only to point objects to non-point objects Example Max C B B A A C Amin Amax Min
Corner Transformation From 2-D to 4-D 1. Simplification by MBR 2. MBR ((Xmin, Ymin), (Xmax, Ymax)) to Point (Xmin, Ymin, Xmax, Ymax) (Xmax, Ymax) (Xmin, Ymin)
Query Processing for Corner Transformation : 1-D Example W Query : Find Contained Objects Max VI IV III A V II A I Min Amin Amax Region I : Wmax < Amin Region II : W A Region III : Amax < Wmin Region IV : Amin < Wmin, Amax < Wmax Region V : Wmin < Amin, Wmax < Amax Region VI : A W
Transformation to Lower Dimension : Linear Quadtree 1. Simplification of Geometry (22, 0) 2. Compute Peano Key with lower-left corner (28, 1) (23, 0) 3. If necessary, divide it and give peano key to each 4. Define the size of each piece according to the number of quadrants 4. Insert them into B-tree 5. Query Processing by B-tree (0, 2)