Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’

Similar presentations


Presentation on theme: "Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’"— Presentation transcript:

1 Chapter 5 Multidimensional Indexes

2 One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’

3 Applications Needing Multiple Dimensions Geographic Information Systems Data Cubes

4 Geographic Information Systems In GIS, data are stored in a two-dimensional space such as map.

5 Typical Queries of GIS Partial match queries Range queries Near-neighbor queries Where-am-I queries

6 Data Cubes Data with multiple properties can be seen as existing in a high-dimensional space. Multidimensional data is gathered by many corporations for decision-support applications

7 An Example of Data Cube A chain store may record each sale made, including: The day and time The store at which the sale was made The item purchased The color of the item The size of the item The other properties Give the sales of pink shirts for each store and each month of 1998

8 Multidimensional Queries in SQL Multidimensional data can be stored in a conventional relational database and we can query them in SQL.

9 Finding the nearest points to (10.0, 20.0) Store points in the relation Points (x, y) with x and y representing the x- and y-coordinates SELECT * FROM POINTS p WHERE NOT EXISTS( SELECT * FROM POINTS q WHERE (q.x-10)*(q.x-10)+(q.y-20)*(q.y-20)< (p.x-10)*(p.x-10)+(p.y-20)*(p.y-20) );

10 Finding the rectangles that contain (10.0, 20.0) rectangles ( ID,xll,yll,xur,yur) SELECT id FROM rectangles WHERE xll<=10 AND yll<=20 AND xur>=10 AND yul>=20 ;

11 Summarizing the sales of pink shirts Sales ( day , store , item , color , size ) SELECT day, store, count(*) AS totalSales FROM sales WHERE item=‘shirt’ AND color=‘pink’ GROUP By day,store;

12 Executing Range Queries Using Conventional Indexes Given ranges in all dimensions, suppose we build a secondary index B+ tree for each dimension. Using B+ tree for each dimension, we could get pointers to all of records in the range for that dimension. We intersect these pointers to get final range query results.

13 The disk I/O for range query includes: to find the way down the B-Trees to examine leaf nodes of each B-tree to retrieve all the matching records

14 Range query asking for pointers in the square of side 100 surrounding the center of the space 10,000 100,000 1000 Disk I/O: 2X(100,000/200+1)+ Number of Data Blocks containing the desired points (at worst 10,000) Little Help 100 Look at every block of data file Suppose a leaf node holding 200 key-point pairs, a block holding 100 records Access the 100,000 pointers in either dimension.

15 Executing Nearest-Neighbor Queries Using Conventional Indexes 1. picking a range in each dimension 2. asking the range query 3. selecting the point closest to the target within that range

16 Two things that could go wrong: No points within distance d of the given point  to repeat the entire process with a higher value of d The distance from the target to the closest point d’ > d  to repeat the search with d’ in place of d * * Closest point in range * Possible closer point

17 Disk I/O to find the nearest neighbor to (10.0, 20.0) Pick d = 1 Examine B-tree for the x-coordinate with range query (10.0-d=9)<=x<=(10.0+d=11) Get about 2,000 points Traverse at least 10 leaves, most likely 11 One disk I/O for an intermediate node Another 12 disk I/O’s for y-coordinate One more disk to retrieve the desired record A total of 25 disk I/O’s Significantly more disk I/O’s

18 Multidimensional Index Structures 1.Hash-table-like approaches (1) Grid Files (2) Partitioned Hash Functions 2.Tree-like approaches (1) Multiple-Key Indexes (2) kd-Trees (3) Quad Trees (4) R-tree 3. Bitmap Indexes

19 Grid Index Key 2 X 1 X 2 …… X m V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2

20 Customers who bought gold jewelry: * * * ** ****** * * * * 0 40 55 100 500K 225K 90K 0 Salary Age (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60, 260)

21 How is Grid Index stored on disk? Like Array... X1X2X3X4 X1X2X3X4X1X2X3X4 V1V2V3 Problem: Need regularity so we can compute position of entry

22 Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3

23 The grid file representing database of customers

24 Lookup in a Grid File The positions of the point in each of the dimensions together determine bucket.

25 Insertion Into Grid Files Lookup the record; place the new record in that bucket. If no room, there are two general approaches as follows: (1)Add overflow blocks to the bucket. (2)Reorganize the structure by adding or moving the grid lines

26 Insertion of the point (52,200) followed by splitting of buckets * * * ** ****** * * * * 0 40 55 100 500K 225K 130K 90K 0 Salary Age *

27 Performance of Grid Files Lookup of Specific Points Read: 1 disk I/O, Insertion/Deletion: 2 disk I/O (+1 if the creation of an overflow block) Partial-Match Queries Look at all the buckets in a row or column of the bucket matrix Range Queries Look at all the buckets that cover the range defined by range queries. Nearest-Neighbor Queries Not easy to put an upper bound on how costly the search is.

28 Idea: Key1 Key2 Partitioned hash function h1h2 010110 1110010

29 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

30 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales  Sal=40k

31 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k look here

32 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales look here

33 Comparison of Grid Files and Partitioned Hashing Grid files are good at nearest-neighbor queries or range queries. Partitioned hashing is good at partial match queries.

34 Tree-Like Structure for Multidimensional Data Multiple-key indexes Kd-trees Quad trees R-trees

35 Motivation: Find records where DEPT = “ Toy ” AND SAL > 50k Multi-key Index

36 Strategy I: Use one index, say Dept. Get all Dept = “ Toy ” records and check their salary I1I1

37 Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:

38 Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3 Index on first attribute Indexes on second attribute

39 Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k

40 Performance of Multiple-Key Indexes Partial-Match Queries quite efficient for the first attribute Range Queries quite well for a range query Nearest-Neighbor Queries the same strategy as the other index structures

41 Partial-Match Queries If the first attribute is specified, the access is quite efficient. If the second attribute is specified, the access is time-consuming.

42 Range Queries Range query on the first attribute to find all of the subindexes Search each of these subindexes, using the range specified for the second attribute ……

43 Nearest-Neighbor Queries Pick a distance d. Ask range query x0-d<=x<=x0+d and y0- d<=y<=y0+d. Find a closest point within this range If no points within the range or the distance from (x0,y0) of the closest point greater than d, increase the range and search again.

44 kd-Trees (k-dimensional search tree) Generalization of the binary search tree to multidemensional data. Interior nodes with an associated attribute A and its dividing value V. The attributes rotating at different levels of the tree. Leaves with blocks holding data records.

45 Salary 150 Age 60 Age 47 Salary 80 70 , 110 85 , 140 Age 38 25 , 60 45 , 60 50 , 75 50 , 100 50 , 120 Salary 300 50 , 275 60 , 260 30 , 260 25,400 45,350 A kd-tree example

46 Tree after insertion of (35,500) Salary 150 Age 60 Age 47 Salary 80 70 , 110 85 , 140 Age 38 25 , 60 45 , 60 50 , 75 50 , 100 50 , 120 Salary 300 50 , 275 60 , 260 30 , 260 25,400 45,350

47 Complex Queries on kd-tree Partial-Match Queries ask for all points with age = 50 Range Queries ask for all points with ages 35 to 55 and salaries $100K to $200K Nearest-Neighbor Queries use the same approach as discussed before

48 Partial-Match Queries ( ask for all points with age = 50) Explore both ways at the level with the unknown attribute. Go one way at the level with the specified attribute.

49 Range Queries ( ask for all points with ages 35 to 55 and salaries $100K to $200K) If the range straddles the splitting value, explore the two children Otherwise, move to only one child.

50 Nearest-Neighbor Queries Treat them as range queries Repeat with a larger range if necessary

51 Two approaches to improve Multiway Branches at Interior Nodes Group Interior Nodes Into Blocks Problem: (1) long paths: log 2 n for a kd-tree with n leaves. (2) unused space: interior nodes with little info.

52 Multiway Branches at Interior Nodes Interior nodes with many key-pointer pairs Keeping distribution and balance as we do for B-tree

53 Group Interior Nodes Into Blocks Packing many interior nodes into a single block. Including in one block a node and its descendants for some number of levels

54 Quad Trees Data points are contained in a square region. If data points in a square can fit in a block, the square will be a leaf of the tree. Otherwise, the square will be an interior node, with children corresponding to its four quadrants.

55 Data organized in a quad tree * * * * 0 100 Salary Age

56 A quad tree

57 R-Trees ( Region Tree ) The R-tree node represents a data region which has subregions as its children. The data region can be of any shape. The subregions do not cover the entire region. The subregions are allowed to overlap.

58 The region of an R-tree node and subregions of its children

59 “Where-am-I” Query Start at the root. Examine the subregions at the root to see whether they contain point P If there are zero regions, P is not in any data region; If there is at least one interior region that contains P, recursively search for P until reaching the leaves.

60 Insert a new region Suppose that leaves have room for six regions.

61

62 Expand a region Expand lower subrange, increase 1000 units Expand upper subrange increase 1200 units.

63 Bitmap Indexes 1. A bitmap index for a field F is a collection of bit-vectors of length n (n: number of records). 2.One bit-vector corresponds to each possible value that may appear in the field F. 3.The vector for value v has 1 in position i if the ith record has v in field F, and it has 0 there if not.

64 An Example of a Bitmap Index Suppose a file has six records with two fields f and g: (30 , foo), (30,bar),(40,baz),(50,foo),(40,bar),(30,baz) f : 30:110001 g: foo:100100 40:001010 bar:010010 50:000100 baz:001001

65 Partial-match queries by bitmap indexes movie ( title,year,length,studioname) SELECT title FROM movie WHERE studioname=‘Disney’ AND year=1995 bitwise AND of the bit vector for year = 1995 and the bit vector for studioName = ‘Disney’

66 Range queries by bitmap indexes Records, 1:(25,60) 2:(45,60) 3:(50,75 ) 4:(50,100) 5:(50,120) 6:(70,140) 7:(85,140) 8:(45,350) Find all records with an age in the range 45 - 55 and a salary in the range 100 - 200, using bitmap indexes as follows. Age : 25 ; 10000000 45 : 01000001 50 : 00111000 70 : 00000100 85 : 00000010 Salary : 60 : 11000000 75 : 00100000 100 : 00010000 120 : 00001000 140 : 00000110 350 : 00000001 45 : 01000001, 50 : 00111000 01000001 OR 00111000 = 01111001 100 : 00010000 , 120 : 00001000 , 140 : 00000110 00010000 OR 00001000 OR 00000110 =00011110 01111001 AND 00011110 = 00011000

67 Compressed Bitmaps Run-length encoding (run: a sequence of i 0’s followed by a 1) The number j (log2i) by j-1 1’s and a single 0, followed with i in binary Concatenate the codes for each run together. i=0, 00; i=1, 01 i=13, 1110 1101

68 Encode and Decode Encode age 25: 100000001000 (0,7)  00 110111 Decode 11101101001011 13, 0, 3  000000000000110001

69 To perform bitwise AND or OR on encoded bit-vectors Decode one run at a time Determine where the next 1 is in each operand bit-vector. If OR, produce 1 at that position of the output; If AND, produce 1 if and only if both operands have their next 1 at the same position

70 25: 00110111 30: 110111 OR First Run 0 7 1 in position 1 1 in position 8 Second Run7 1 in position 9 Result 100000011

71 Managing Bitmap Indexes Finding Bit-Vectors Finding Records Handling Modifications to the Data File

72 Finding Bit-Vectors Use any secondary index with the field value as search key, such as B-tree, hash table or indexed-sequential files.

73 Finding Records Use a secondary index on the data file, whose search key is the number of the record.

74 Handling Modifications to the Data file Record numbers must remain fixed once assigned Changes to the data file require the bitmap index to change as well

75 Deletion Record i Leave a “ tombstone “ in the data file Change the bit-vector in position i from 1 to 0.

76 Insert New Record Assign the next available record number to the new record. Modify the bit-vector for the value of the new record by appending a 1 at the end Add the new bit-vector for the value which did not appear before. Insert the new bit-vector and its corresponding value to the secondary index.

77 Modification the value of record i from v to w Change bit-vector for v in position i from 1 to 0 Change bit-vector for w in position i from 0 to 1, or create a bit-vector for w if w is a new value.

78 Conclusion Multidimensional Data Grid files Partitioned Hash Tables Multiple-Key Indexes Kd-Trees Quad Trees R-Trees Bitmap Indexes

79 Exercises Ex 4.1.2, Ex 4.2.6, Ex 4.3.1, Ex 4.4.6 Ex 5.1.3, Ex 5.2.7, Ex 5.3.2, Ex 5.4.2


Download ppt "Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’"

Similar presentations


Ads by Google