Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University
1 and 2D Indexing Index structures we know so far are one dimensional –Event when indexing on multiple attributes –The attributes are either: ordered – a 1 dimensional binary relation hashed – placed into a 1-d hash space Not all data are 1 dimensional –Substructure within data items –Need to be looked up on several fields
Dimensionality in Std. DBs What is the dimensionality of data in the relational model? The arity of a relational, data is inherently multi- dimensional in DBs Do relational DBs support mutli-dimensional queries using only 1-d indices? Several techniques –Mutliple indices – can look up on different attributes, but still only one at a time –General queries – can conduct any query, this is the power of the relational model! What is the outstanding problem? Indices optimize queries. While there is support for multi-d data, the indices are not “tuned” for these queries
Overview of Techniques Multidimensional hash tables –Grid files –Partitioned hash functions Hierarchical indices –Multiple-key indices Multidimensional trees –Kd-trees –Quad trees –R-trees
Applications: Geographic Data Geographic information systems – map Circuit design – placment of components Queries –Partial match – match some dimensions, find all objects in others. Equality on some dimensions. –Range – find objects in ranges in dimensions –Nearest neighbor – find objects close the a point or specified object –Where-am-I queries – reverse mapping of a point to an object, e.g. mouse click to button
Applications: Data Cubes View all data as high-dimensional –Consider a sale day and time store item cost –Creates a 4-d grid Information in this grid can be clustered –Decision support –Data mining Look for trends in data –Example: determine what products sell in what stores and bind it to demographic/political/cultural data
Multdimensional Queries in SQL SQL support for a nearest-neighbor query –Relation POINTS { float x, float y } Find the nearest point to point (10.0, 20.0) SELECT * FROM POINTS p WHERE NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-10.0)*(q.x-10.0) + (q.y-20.0)*(q.y-20.0) < (p.x- 10.0)*(p.x-10.0) + (p.y-20.0)*(p.y-20.0)
Multdimensional Queries in SQL SQL support for a point in rectangle query –Relation RECT { id, xll, yll, xur, yur } SELECT id FROM RECT WHERE xll = 10 AND yur >= 20.0
Grid Files Partition each dimension into ranges –Create a bucket (block) for each combination of dimensions –Buckets are in n-dimensions now Lookup – index in both dimension Insert – reverse lookup and insert –Complexities come if out of space Chain blocks in a grid bucket Reorganize grid lines/add new grid lines –Same skew problems as with range partitioning, just in mutliple dimensions now
Grid File Support for Queries Partial match: scoped to buckets in the specified dimensions Range: scoped to buckets in ranges Nearest neighbor: need to consider grid boundaries (draw), but scoped to feasible buckets Where-am-I: no, this represents data points not data objects
Partitioned Hashing For a series of hash attributes A1,A2,A3,…,An compute a function h=h1(A1),h2(A2),…,hn(An) Queries that only specify some dimensions, are scoped to suitable buckets If one specified all dimension except A2 and A4 with 3 bits per bucket Look in buckets 101XXX010XXX1100…..
Part. Hash. Query Support Partial match: scoped to buckets in the specified dimensions Range: useless Nearest neighbor: useless Where-am-I: useless Relation between Part. Hash and Grids is similar to that between range and hash partitioning –Skew and generality
Mutliple Key Indexes Tree like multi-dimensional structure Figure 5.11 Partial match: scoped when the higher dimensions are specified, otherwise bad news Range: very effective Nearest neighbor: reasonably efficient when built on top of a range query –E.g. find all neighbors less than distance d and compute their distance
Kd-trees B-Tree in which each level alternates attribute –Leaves occur when only a block’s worth of tuples are specified Figure 5.13 – specifies with block size of 2
Kd–trees Query Support Partial-match: only on specified attributes Range queries – when a range straddles a branch, must explore both sides –But this is what they are good for Nearest neighbor, same approach as muliple-key indexes Compared –Kd-trees might (depending on data) provide better scoping by alternating between dimensions –Gains are specious for increased complexity
Quad Trees Each interior node divides the tree into another dimension of square regions Figure 5.17
Quad Trees Query Support Partial-match: on all attributes Range queries – yes, but all overlapping quads Nearest neighbor – only in so far as range queries Has more in common with grid files Problems – with knowing domains a priori – skew in data leads to different dimensionality of regions and many empty regions
Region Trees (R-Trees) Partial-match: on all attributes, but complicated by overlap Range queries – yes, but complicated by overlap Nearest neighbor – only in so far as range queries Where am I – yes, can represent objects in R-tree regions Complexities –Managing shapes, limiting overlap, preserving containment property Overlap is required for containment property to server the where-am-I query
R-Trees Query Support Represent objects, not just points Good for spatial data Capture the spirit of B-Trees for multi-dimensional data –B-tree divides a line (1-d space) into intervals –R-tree divides a space (n-d space) into regions generally use simple shapes, like rectangles –Regions may overlap, but should do so minimally –Each object should be contained entirely within a single region Develop 5.20 and 5.21 –Add another house 5.22