Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Nearest Neighbor Search
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Searching on Multi-Dimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Multidimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
CS 4432lecture #11 - indexing & hashing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
COMP 451/651 B-Trees Size and Lookup Chapter 1.
2-dimensional indexing structure
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
COMP 451/651 Multiple-key indexes
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
BITMAP INDEXES Parin Shah (Id :- 207). Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CS CS4432: Database Systems II. CS Index definition in SQL Create index name on rel (attr) (Check online for index definitions in SQL) Drop.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Indexing and Hashing.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
BITMAP INDEXES Sai Priya Rama Gopal SJSU ID : Class ID: 125.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
BITMAP INDEXES Barot Rushin (Id :- 108).
Chapter 5. Multidimensional Indexes
CS 245: Database System Principles
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Multidimensional Access Structures
CS 245: Database System Principles
Yan Huang - CSCI5330 Database Implementation – Access Methods
Multidimensional Indexes
CS 245: Database System Principles
Database Design and Programming
Presentation transcript:

Chapter 5 Multidimensional Indexes

One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’

Applications Needing Multiple Dimensions Geographic Information Systems Data Cubes

Geographic Information Systems In GIS, data are stored in a two-dimensional space such as map.

Typical Queries of GIS Partial match queries Range queries Near-neighbor queries Where-am-I queries

Data Cubes Data with multiple properties can be seen as existing in a high-dimensional space. Multidimensional data is gathered by many corporations for decision-support applications

An Example of Data Cube A chain store may record each sale made, including: The day and time The store at which the sale was made The item purchased The color of the item The size of the item The other properties Give the sales of pink shirts for each store and each month of 1998

Multidimensional Queries in SQL Multidimensional data can be stored in a conventional relational database and we can query them in SQL.

Finding the nearest points to (10.0, 20.0) Store points in the relation Points (x, y) with x and y representing the x- and y-coordinates SELECT * FROM POINTS p WHERE NOT EXISTS( SELECT * FROM POINTS q WHERE (q.x-10)*(q.x-10)+(q.y-20)*(q.y-20)< (p.x-10)*(p.x-10)+(p.y-20)*(p.y-20) );

Finding the rectangles that contain (10.0, 20.0) rectangles ( ID,xll,yll,xur,yur) SELECT id FROM rectangles WHERE xll<=10 AND yll<=20 AND xur>=10 AND yul>=20 ;

Summarizing the sales of pink shirts Sales ( day , store , item , color , size ) SELECT day, store, count(*) AS totalSales FROM sales WHERE item=‘shirt’ AND color=‘pink’ GROUP By day,store;

Executing Range Queries Using Conventional Indexes Given ranges in all dimensions, suppose we build a secondary index B+ tree for each dimension. Using B+ tree for each dimension, we could get pointers to all of records in the range for that dimension. We intersect these pointers to get final range query results.

The disk I/O for range query includes: to find the way down the B-Trees to examine leaf nodes of each B-tree to retrieve all the matching records

Range query asking for pointers in the square of side 100 surrounding the center of the space 10, , Disk I/O: 2X(100,000/200+1)+ Number of Data Blocks containing the desired points (at worst 10,000) Little Help 100 Look at every block of data file Suppose a leaf node holding 200 key-point pairs, a block holding 100 records Access the 100,000 pointers in either dimension.

Executing Nearest-Neighbor Queries Using Conventional Indexes 1. picking a range in each dimension 2. asking the range query 3. selecting the point closest to the target within that range

Two things that could go wrong: No points within distance d of the given point  to repeat the entire process with a higher value of d The distance from the target to the closest point d’ > d  to repeat the search with d’ in place of d * * Closest point in range * Possible closer point

Disk I/O to find the nearest neighbor to (10.0, 20.0) Pick d = 1 Examine B-tree for the x-coordinate with range query (10.0-d=9)<=x<=(10.0+d=11) Get about 2,000 points Traverse at least 10 leaves, most likely 11 One disk I/O for an intermediate node Another 12 disk I/O’s for y-coordinate One more disk to retrieve the desired record A total of 25 disk I/O’s Significantly more disk I/O’s

Multidimensional Index Structures 1.Hash-table-like approaches (1) Grid Files (2) Partitioned Hash Functions 2.Tree-like approaches (1) Multiple-Key Indexes (2) kd-Trees (3) Quad Trees (4) R-tree 3. Bitmap Indexes

Grid Index Key 2 X 1 X 2 …… X m V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2

Customers who bought gold jewelry: * * * ** ****** * * * * K 225K 90K 0 Salary Age (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60, 260)

How is Grid Index stored on disk? Like Array... X1X2X3X4 X1X2X3X4X1X2X3X4 V1V2V3 Problem: Need regularity so we can compute position of entry

Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3

The grid file representing database of customers

Lookup in a Grid File The positions of the point in each of the dimensions together determine bucket.

Insertion Into Grid Files Lookup the record; place the new record in that bucket. If no room, there are two general approaches as follows: (1)Add overflow blocks to the bucket. (2)Reorganize the structure by adding or moving the grid lines

Insertion of the point (52,200) followed by splitting of buckets * * * ** ****** * * * * K 225K 130K 90K 0 Salary Age *

Performance of Grid Files Lookup of Specific Points Read: 1 disk I/O, Insertion/Deletion: 2 disk I/O (+1 if the creation of an overflow block) Partial-Match Queries Look at all the buckets in a row or column of the bucket matrix Range Queries Look at all the buckets that cover the range defined by range queries. Nearest-Neighbor Queries Not easy to put an upper bound on how costly the search is.

Idea: Key1 Key2 Partitioned hash function h1h

h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales  Sal=40k

h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Sal=30k look here

h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales look here

Comparison of Grid Files and Partitioned Hashing Grid files are good at nearest-neighbor queries or range queries. Partitioned hashing is good at partial match queries.

Tree-Like Structure for Multidimensional Data Multiple-key indexes Kd-trees Quad trees R-trees

Motivation: Find records where DEPT = “ Toy ” AND SAL > 50k Multi-key Index

Strategy I: Use one index, say Dept. Get all Dept = “ Toy ” records and check their salary I1I1

Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:

Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3 Index on first attribute Indexes on second attribute

Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k

Performance of Multiple-Key Indexes Partial-Match Queries quite efficient for the first attribute Range Queries quite well for a range query Nearest-Neighbor Queries the same strategy as the other index structures

Partial-Match Queries If the first attribute is specified, the access is quite efficient. If the second attribute is specified, the access is time-consuming.

Range Queries Range query on the first attribute to find all of the subindexes Search each of these subindexes, using the range specified for the second attribute ……

Nearest-Neighbor Queries Pick a distance d. Ask range query x0-d<=x<=x0+d and y0- d<=y<=y0+d. Find a closest point within this range If no points within the range or the distance from (x0,y0) of the closest point greater than d, increase the range and search again.

kd-Trees (k-dimensional search tree) Generalization of the binary search tree to multidemensional data. Interior nodes with an associated attribute A and its dividing value V. The attributes rotating at different levels of the tree. Leaves with blocks holding data records.

Salary 150 Age 60 Age 47 Salary , , 140 Age , , , , , 120 Salary , , , ,400 45,350 A kd-tree example

Tree after insertion of (35,500) Salary 150 Age 60 Age 47 Salary , , 140 Age , , , , , 120 Salary , , , ,400 45,350

Complex Queries on kd-tree Partial-Match Queries ask for all points with age = 50 Range Queries ask for all points with ages 35 to 55 and salaries $100K to $200K Nearest-Neighbor Queries use the same approach as discussed before

Partial-Match Queries ( ask for all points with age = 50) Explore both ways at the level with the unknown attribute. Go one way at the level with the specified attribute.

Range Queries ( ask for all points with ages 35 to 55 and salaries $100K to $200K) If the range straddles the splitting value, explore the two children Otherwise, move to only one child.

Nearest-Neighbor Queries Treat them as range queries Repeat with a larger range if necessary

Two approaches to improve Multiway Branches at Interior Nodes Group Interior Nodes Into Blocks Problem: (1) long paths: log 2 n for a kd-tree with n leaves. (2) unused space: interior nodes with little info.

Multiway Branches at Interior Nodes Interior nodes with many key-pointer pairs Keeping distribution and balance as we do for B-tree

Group Interior Nodes Into Blocks Packing many interior nodes into a single block. Including in one block a node and its descendants for some number of levels

Quad Trees Data points are contained in a square region. If data points in a square can fit in a block, the square will be a leaf of the tree. Otherwise, the square will be an interior node, with children corresponding to its four quadrants.

Data organized in a quad tree * * * * Salary Age

A quad tree

R-Trees ( Region Tree ) The R-tree node represents a data region which has subregions as its children. The data region can be of any shape. The subregions do not cover the entire region. The subregions are allowed to overlap.

The region of an R-tree node and subregions of its children

“Where-am-I” Query Start at the root. Examine the subregions at the root to see whether they contain point P If there are zero regions, P is not in any data region; If there is at least one interior region that contains P, recursively search for P until reaching the leaves.

Insert a new region Suppose that leaves have room for six regions.

Expand a region Expand lower subrange, increase 1000 units Expand upper subrange increase 1200 units.

Bitmap Indexes 1. A bitmap index for a field F is a collection of bit-vectors of length n (n: number of records). 2.One bit-vector corresponds to each possible value that may appear in the field F. 3.The vector for value v has 1 in position i if the ith record has v in field F, and it has 0 there if not.

An Example of a Bitmap Index Suppose a file has six records with two fields f and g: (30 , foo), (30,bar),(40,baz),(50,foo),(40,bar),(30,baz) f : 30: g: foo: : bar: : baz:001001

Partial-match queries by bitmap indexes movie ( title,year,length,studioname) SELECT title FROM movie WHERE studioname=‘Disney’ AND year=1995 bitwise AND of the bit vector for year = 1995 and the bit vector for studioName = ‘Disney’

Range queries by bitmap indexes Records, 1:(25,60) 2:(45,60) 3:(50,75 ) 4:(50,100) 5:(50,120) 6:(70,140) 7:(85,140) 8:(45,350) Find all records with an age in the range 45 - 55 and a salary in the range 100 - 200, using bitmap indexes as follows. Age : 25 ; : : : : Salary : 60 : : : : : : : , 50 : OR = : , 120 : , 140 : OR OR = AND =

Compressed Bitmaps Run-length encoding (run: a sequence of i 0’s followed by a 1) The number j (log2i) by j-1 1’s and a single 0, followed with i in binary Concatenate the codes for each run together. i=0, 00; i=1, 01 i=13,

Encode and Decode Encode age 25: (0,7)  Decode , 0, 3 

To perform bitwise AND or OR on encoded bit-vectors Decode one run at a time Determine where the next 1 is in each operand bit-vector. If OR, produce 1 at that position of the output; If AND, produce 1 if and only if both operands have their next 1 at the same position

25: : OR First Run in position 1 1 in position 8 Second Run7 1 in position 9 Result

Managing Bitmap Indexes Finding Bit-Vectors Finding Records Handling Modifications to the Data File

Finding Bit-Vectors Use any secondary index with the field value as search key, such as B-tree, hash table or indexed-sequential files.

Finding Records Use a secondary index on the data file, whose search key is the number of the record.

Handling Modifications to the Data file Record numbers must remain fixed once assigned Changes to the data file require the bitmap index to change as well

Deletion Record i Leave a “ tombstone “ in the data file Change the bit-vector in position i from 1 to 0.

Insert New Record Assign the next available record number to the new record. Modify the bit-vector for the value of the new record by appending a 1 at the end Add the new bit-vector for the value which did not appear before. Insert the new bit-vector and its corresponding value to the secondary index.

Modification the value of record i from v to w Change bit-vector for v in position i from 1 to 0 Change bit-vector for w in position i from 0 to 1, or create a bit-vector for w if w is a new value.

Conclusion Multidimensional Data Grid files Partitioned Hash Tables Multiple-Key Indexes Kd-Trees Quad Trees R-Trees Bitmap Indexes

Exercises Ex 4.1.2, Ex 4.2.6, Ex 4.3.1, Ex Ex 5.1.3, Ex 5.2.7, Ex 5.3.2, Ex 5.4.2