Access Methods for Advanced Database Applications
Applications Geographic Information Systems / Spatial DB Text databases XML databases Data warehouse High-dimensional databases (image, scientific) Time series Sequence databases (genomic databases) Main memory database systems
Why New Indexes? A most effective mechanism to prune the search Order of magnitude of difference between I/O and CPU cost Increasing data size Increasing complexity of data and search
Memory System CPU Registers L1 Cache CPU Die L2 Cache Main Memory Harddisk
Memory Hierarchy
Improvement in Performance CPU (60%/yr) DRAM (10%/yr)
Design Principles Simple in design Efficient in disk access/CPU time Not necessary contradicting the simplicity! Ease of integration into existing DBMS Built on top of the mature index (eg. B + -tree & R-tree)? Reuse all the well tested concurrency control etc.
Spatial Databases Spatial Objects: –Points: spatial location: eg. feature vectors –Lines: set of points: eg. roads, coastal line –Polygons: set of points: eg. Buildings, lakes Data Types: –Point: a spatial data object with no extension no size or volume –Region:a spatial object with a location and a boundary that defines the extension
Spatial Queries Range queries: “Find all cities within 50 km of Madras?” Nearest neighbor queries: “Find the 5 cities that are nearest to Madras?” “Find the 10 images most similar to this image?” Spatial join queries: “Find pairs of cities within 200 km of each other?’
More Examples Range Query: “Find me data points that satisfy the conditions x1 <A1 < y1, x2 <A2 <y2…?” Spatial Query: “Find me buildings that are adjacent to the Railway Stations?” Nearest Neighbour Query: “Find me the nearest fire station to Clementi Ave. 3?”
Applications Geographical Information Systems (GIS): dealing extensively with spatial data. Eg. Map system, resource management systems Computer-aided design and manufacturing (CAD/CAM): dealing mainly with surface data. Eg. design systems. Multimedia databases: storing and manipulating characteristics of MM objects.
Representation of Spatial Objects Testing on real objects is expensive Minimum Bounding Box/Rectangle How to test if 2-d rectangles intersect?
Approaches to Multi-Dimensional Indexing Data Partitioning –R-tree, R*-tree, X-tree, Skd-tree, SS-tree, TV- tree, M-tree Space Partitioning –Buddy-tree, R+-tree, Grid File, KDB-tree Mapping R-trees
A B AB R-trees R-trees
Range Query Insert –Node splitting –Optimization Coverage Overlap Delete Variants: R+-tree R*-tree, buddy-tree
Space Filling Curves Assumption: att. values can be represented with some fixed # of bits Space domain on each dimension: 2 k values Linearize the doman Each point can be represented by a single dimensional value
Z-ordering
Z-ordering The z-value is obtained by interleaving the bits. Eg. X=01, Y=11 z-value = 0111 = 7 Clustering effect on X-Y and z-values can be indexed using B + -trees Range queries: problematic?
Hilbert Curve
Grid Files Based on extendible hashing Design principle: any point query can be answered in at most 2 disk accesses. Two structures: k-dimensional array and k 1- dimensional array
Extendible Hashing Situation: Hash Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? –Reading and writing all pages is expensive! –Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! –Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! –Trick lies in how hash function is adjusted!
Example Directory is array of size 4. To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). –If h(r) = 5 = binary 101, it is in bucket pointed to by 01. v Insert : If bucket is full, split it ( allocate new page, re-distribute ). v If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) 13* LOCAL DEPTH GLOBAL DEPTH DIRECTORY Bucket A Bucket B Bucket C Bucket D DATA PAGES 10* 1*21* 4*12*32* 16* 15*7*19* 5*
Insert h(r)=20 (Causes Doubling) 20* LOCAL DEPTH 2 2 DIRECTORY GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 1* 5*21*13* 32* 16* 10* 15*7*19* 4*12* 19* DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1*5*21*13* 16* 10* 15* 7* 4* 20* 12* LOCAL DEPTH GLOBAL DEPTH
Points to Note 20 = binary Last 2 bits (00) tell us r belongs in A or A2. Last 3 bits needed to tell which. –Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. –Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. When does bucket split cause directory doubling? –Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)
Directory Doubling Why use least significant bits in directory? ó Allows for doubling via copying! vs * 6 = * 6 = Least SignificantMost Significant
Comments on Extendible Hashing If directory fits in memory, equality search answered with one disk access; else two. –100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. –Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. –Multiple entries with same hash value cause problems! Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.
Summary on Extendible Hashing Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) –Directory to keep track of buckets, doubles periodically. –Can get large with skewed data; additional I/O if this does not fit in main memory.
Grid Files
Scales, Directory, Bucket Data structures: –Linear scales –directory: an array whose elements are one-to- one correspondence with the grid cells; each entry points to a data bucket –data buckets
Splitting and Merging
Grid Files... Repetitive splitting by halving Merging based on buddy system Regions are represented as (cx, cy, dx, dy) –point queries: cx-dx <= qx <= cx+dx, –& cy-dy <= qy <= cy+dy
Grid Files... E D C F B A qx cx dx C D E A B F cx cy qy cy dy C D E A B F