Chapter 5. Multidimensional Indexes

Chapter 5. Multidimensional Indexes
Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ. Chapter 5

One Dimensional Indexes
One search key Search key, (F1, F2, …, Fk), is also one dimensional Chapter 5

Multidimensional Index Schemes
Here, we talk about: Grid files Partitioned hash functions Multiple-key indexes kd-trees Quad trees R-trees Bitmap indexes Chapter 5

Applications of Multidimensional Indexes (1)
Geographic Information Systems 2-dimensional space Objects: points, shapes, and so on Query types Partial match queries (all points with specified values in a subset of the dimensions) Range queries (all points within a range in each dimension) Nearest-neighbor queries (closest points to a given point) Where-am-I queries (regions containing a given point) Chapter 5

Some objects in 2-dimensional space
road 1 r o a d 2 house1 pipeline house2 school 100 Chapter 5

Applications of Multidimensional Indexes (2)
Data Cubes (Data warehouse, decision-support systems) Example a chain store may record each sale made, including The day and time The store at which the sale was made The item purchased The color of the item The size of the item Queries typically group the data along some of the dimensions and summarize the groups by an aggregation Chapter 5

Multidimensional Queries in SQL (1)
Points in two-dimensional space: Points(x, y) Find nearest points to the point (10.0, 20.0) e.g., SELECT * FROM POINTS p WHERE NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x – 10.0)*(q.x-10.0) + (q.y – 20.0)*(q.y-20.0) < (p.x – 10.0)*(p.x-10.0) + (p.y – 20.0)*(p.y-20.0); Chapter 5

Rectangles in two-dimensional space Rectangles(id, xll, yll, xur, yur) Find rectangles containing the point (10.0, 20.0) e.g., SELECT id FROM Rectangles WHERE xll <= 10.0 AND yll <= 20.0 AND xur >= 10.0 AND yur >= 20.0; Chapter 5

Query: summarize the sales of pink shirts by day and store e.g., SELECT day, store, COUNT(*) AS totalSales FROM Sales WHERE item = ‘shirt’ AND color =‘pink’ GROUP BY day, store; - Schema: Sales(day, store, item, color, size) Chapter 5

Range Queries with B-tree (1)
Given that Ranges in 2-dimensions A secondary index on each of the dimensions, x and y Steps Using the B-tree for x (the # of I/O) Using the B-tree for y (the # of I/O) Intersecting these pointers (total # of I/O) Chapter 5

Example 5.5 Assumptions 1,000,000 points x- and y-coordinates range: 0 – 1000 100 point records fit on a block About 200 key-pointer pairs in a B-tree leaf Questions Imagine points in the square of side 100 surrounding the center of the space (450  x  550, 450  y  550) Do indexes help? Chapter 5

With Indexes Using B-tree for x: 100,000 points The roots of the B-trees are already kept in memory One intermediate-level node (1 I/O) All the leaves that contain the desired pointers ((100,000 / 200) = 500 I/Os) Using B-tree for y: 100,000 points Same above Total disk I/Os are 1002 Without indexes 10,000 (= 1,000,000 / 100) I/Os are required Chapter 5

Nearest-Neighbor Queries with Conventional Indexes
Possible steps Determine d Find all points inside the rectangular Find the nearest one to the given point What could go wrong There is no point within the selected range (rectangular) The closest point within the range might not be the closest point overall * Closest point in range Possible closer point Distance d Chapter 5

Example 5.6 (1) Assumptions Same data and indexes as in Example 5.5
We want the nearest neighbor to point P = (10,20) We pick d = 1 There is at least one point in the range Chapter 5

Example 5.6 (2) Total I/Os Using B-tree for x: 2,000 points, (9  x  11) One intermediate-level node (1 I/O) All the leaves that contain the desired pointers ((2,000 / 200) = 10 or 11 I/O’s) Using B-tree for y: 2,000 points, (19  x  21) I/Os for x-coordinate: About 12 I/Os I/Os for y-coordinate: About 12 I/Os One more disk I/O to retrieve the desired record (1 I/O) Total I/Os are about 25 Chapter 5

Multidimensional Index Structures (1)
Hash-table-like approaches Grid files Partitioned hash functions Tree-like approaches Multiple key indexes kd trees, R trees, Quad trees Chapter 5

Multidimensional Index Structures (2)
Hash-table-like approaches lose: Answer to query is always in one bucket Tree-like approaches lose: The balance of the tree is not guaranteed The correspondence between tree nodes and disk blocks Modification speed Chapter 5

Grid Files (1) In each dimension, grid lines partition the space into stripes Considerations The number of lines in each dimension Points falling on a grid line Chapter 5

Grid Files (2) Key 2 X1 X2 … Xn V1 V2 Vn Key 1
To records with key1=V3, key2=X2 Chapter 5

Grid Files (3) Can quickly find records with And also ranges … .
key 1 = Vi  Key 2 = Xj key 1 = Vi key 2 = Xj And also ranges … . E.g., key 1  Vi  key 2 < Xj Chapter 5

Grid Files: Example Who buys gold jewelry? Attributes: age and salary
Tuples: twelve customers * 500K 225K 90K 40 55 100 (25, 60) (45, 60) (50, 75) (50, 100) (50, 120) (70, 110) (85, 140) (30, 260) (25, 400) (45, 350) (50, 275) (60, 260) Chapter 5

Grid Files: Use Indirection
Buckets V1 V2 V *Grid only V contains pointers to buckets X1 X2 X3 -- -- -- -- -- Chapter 5

Use Indirection Grid can be regular without wasting space
We do have price of indirection Chapter 5

Use Indirection: Example
30, 260 25, 400 25, 60 45, 60 50, 75 50, 100 50, 120 70, 110 85, 140 60, 260 45, 350 50, 275 225+ 90-225 0-90 0-40 40-55 55+ Chapter 5

Insertion into Grid Files
If there is room in the bucket, no problem (just insert newcomers) When there is no room in the bucket, Add overflow blocks to the buckets Reorganize the structure by adding or moving the grid lines This is not a simple problem to implement ! Chapter 5

Insertion Example (1) Suppose inserting someone (52 year old, income of $200K) There are three ways to split the the central bucket (see next page) A vertical line (such as age = 51) A horizontal line separates the point with salary = 200 from the other two points (say salary = 130) A horizontal line separates the point with salary = 100 from the other two points (say salary = 115) How to decide which one is the best is not simple ! Chapter 5

Insertion Example (2) * 500K 225K 130K 90K 40 55 100 Chapter 5

Indexing Grid on Value Ranges
With many stripes, we need to create an index for each dimension Linear Scale 1 2 3 Toy Sales Personnel 0-20K 20K-50K 50K- 8 Chapter 5

Performance of Grid Files (1)
If data is well distributed, and the data file is not too large, then we may assume: Few buckets; the bucket matrix in memory Indexes on grid lines fit in memory A few overflow blocks Lookup of specific points One disk I/O to read Insertion and Deletion need one more disk I/O to write (overflow handle) Partial-match queries Need to look at all the buckets in a row or column of the bucket matrix Chapter 5

Performance of Grid Files (2)
Ranges queries All buckets on the border and on the interior need a disk I/O Nearest-neighbor queries Given a point P Searching the bucket in which the P belongs Finding a candidate point Q Searching adjacent buckets Chapter 5

Summary of Grid Files Good for multiple-key search
Space, management overhead Nothing is free Need partitioning ranges that evenly split keys Chapter 5

Partitioned Hash Functions (1)
Hash function h, which produces k bits values, is a list of hash functions (h1, h2, …, hn) hi applies to a value for the ith attribute and produces a sequence of ki bits  ki = k h1 h2 Key1 Key2 Chapter 5

<Joe><Sally>
Example (1) h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = <Fred,toy,10k>,<Joe,sales,10k> <Sally,art,30k> <Joe><Sally> <Fred> Insert Chapter 5

Example (2) h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 .
<Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = Find Employees with Dept. = Sales  Sal=40k Chapter 5

Example (3) h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 .
<Fred> h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = Find Employees with Sal=30k Look here <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Chapter 5

“gold jewelry” Example
30, 260 50, 120 50, 100 60, 260 000 001 70, 110 010 011 100 50, 75 101 50, 275 110 111 25, 60 45, 60 25, 400 85, 140 45, 350 Chapter 5

Partitioned Hash Functions
Does not work well for nearest-neighbor or range queries (since physical distance between points is not reflected) Good for partial match queries A well chosen hash function randomizes the buckets, so buckets tend to be equally occupied. Chapter 5

Grid Files vs. Partitioned Hashing
Empty buckets A well hash function make it small The more dimensions, the more empty buckets Partial match queries Partitioned Hashing is better Range queries Grid Files is better Nearest-neighbor queries Chapter 5

Tree-Like Structures for Multidimensional Data
Usefulness Range queries and nearest-neighbor queries Types Multiple-key indexes kd-trees Quad trees R-tree The first three are intended for sets of points, while the last is used for sets of regions Chapter 5

Multi-key Index Multi-key index is an index of indexes, that is a tree in which the nodes at each level are indexes for one attribute Chapter 5

Multi-key Index: Strategy I
Consider: Find records where DEPT = “Toy” AND SAL > 50K Use one index, say Dept Get all Dept = “Toy” records and check their salary I1 Chapter 5

Multi-key Index: Strategy II
Use 2 Indexes; Manipulate Pointers Toy Sal > 50K Chapter 5

Multi-key Index: Strategy III
Multiple key index I2 One idea: I3 I1 Chapter 5

Multi-key Index: Strategy III
Example 10k 15k Art Sales Toy 17k Example Record 21k Name=Joe DEPT=Sales SAL=15k Dept Index 12k 15k 15k 19k Salary Index Chapter 5

Example 5.13 (gold jewelry)
25 30 45 50 60 70 85 400 260 350 75 100 120 275 110 140 Chapter 5

Performance of Multiple-key Indexes
Partial match queries If the first attribute is specified, access is efficient. Otherwise, must search every subindex, a potential time-consuming process Range queries Works well if the individual indexes support range queries on their attributes Nearest-neighbor queries Involves the several steps as before Chapter 5

kd-Trees (1) Classical kd-tree Two modifications to a block model
Main memory data structure Generalizing the binary search tree to multidimensional data Interior nodes have an associated attribute a and a value V Two modifications to a block model Interior nodes have only an attribute Leaves are blocks Chapter 5

kd-Trees (2) ): Example 5.15 Kd-tree example
Given “gold-jewelry” example Blocks hold only two records The interior nodes are ovals with an attribute – either age or salary The root splits by salary At the second level, the split is by age Chapter 5

kd-Trees (3) Salary 150 Age 47 Age 60 Salary 80 Salary 300 Age 38
70, 110 85, 140 50, 275 60, 260 50, 100 50, 120 Age 38 25, 60 45, 60 50, 75 30, 260 25, 400 45, 350 Chapter 5

Quad Trees Interior node corresponds to a square region
A leaf of the tree The # of points in a square is no larger than what will fit in a block Children In k dimensions, interior node has 2K children We are constrained to pick the center of a quad-tree region, which may or may not divide the points in that region evenly Chapter 5

Quad Tree Example (1) Gold-jewelry data
Only two records can fin in a block NW and SE quadrants have more than 2 points, so both are split into subquadrants. * 400K 100 Chapter 5

Quad Tree Example (2) 50, 200 25, 60 50, 275 75, 100 25, 300 45, 60 60, 260 50, 75 85, 140 50, 120 30, 260 25, 400 50, 100 70, 110 45, 350 Chapter 5

R-Trees (1) R-tree (region-tree) Useful queries
Represents data that consists of n-dimensional regions, which we call “data regions” An internal code of R-trees corresponds to some “interior region” (or simply, region) The region can be of any shape, and in practice it is usually a rectangular Useful queries “Where-am-I” Chapter 5

R-Trees (2) Subregions Do not cover the entire region
Are allowed to overlap Chapter 5

Operation Example (1) POP is newly inserted, so split leaf
((0,0), (60,50)) ((20, 20), (100,80)) road1 road2 house1 school house2 pipeline pop road 1 r o a d 2 house1 pipeline house2 school pop Operation Example (1) POP is newly inserted, so split leaf Chapter 5

Operation Example (2) Now, house3 is inserted
Either leafs do not contain it wholly, so expand regions Strategy: expand regions as little as possible Spilt nodes if necessary road 1 r o a d 2 house1 pipeline house2 school pop house3 Chapter 5

Bitmap Indexes Example
Six records 1. (30, foo), 2. (30, bar), 3. (40, baz) 4. (50, foo), 5. (40, bar), 6. (30, baz) One vector for each value, each bit for each record !!! Space requirements Total # of bits = # of records X # of values Value Vector foo bar baz Chapter 5

Bitmap Indexes A bitmap index for a field F is a collection of bit-vectors of length n, one for possible value that may appear in the field F. Can use bit-wise operation Gig space requirements  compression Usefulness Let us find the ith record easily for any i Partial-match queries Chapter 5

Example 5.21 Schema and query
Movie (title, year, length, studioName) SELECT title FROM Movie WHERE studioName = ‘disney’ AND year = 1995; Bitmap indexes on both studioName and year Bitwise AND of the victors for year = 1995 and studioName = ‘Disney’ Then we could find tuples with tuple numbers Chapter 5

Range Queries with Bitmaps (1)
“Gold jewelry” Example Index for age 25: : : 50: : : 85: Index for salary 60: : : 110: : : 260: : : 400: Chapter 5

Range Queries with Bitmaps (2)
Find the jewelry buyers with age and salary Find bit-vectors for the age Two values (45 and 50) and take bitwise OR OR = Find bit-vectors for the salary 4 values and take bitwise OR of all of them We get The last step operation AND = So, the fourth and fifth records are in the desired range Chapter 5

Compressed Bitmaps General statistics Total # of records: n
Total # of different values: m Block size: 4096 bytes Total bits: mn Total blocks: mn / 32768 * The probability that any bit is 1: 1/m Chapter 5

Compressed Bitmaps Run-length encoding
If 1’s are rare, we have an opportunity to encode Concatenating the codes for each run together A run A sequence of i 0’s followed by a 1, by some suitable binary encoding of the integer i Need a run-length information (See next page) Chapter 5

Binary numbers won’t serve as a run-length encoding
: two runs (3 and and 1 produce 111) : two runs (1 and and 11 produce 111) : three runs (1, 1 and 1 – 1,1, and 1 produce 11) No decoding at all Chapter 5

Compressed Bitmaps Other scheme Examples
First determine how many bits the binary representation of i has (call it j) j is approximately Log2i Notation: (j – 1) 1’s + a single 0 + i in binary Examples i = 13 Then j = 4 Binary representation for i: 1101 Binary representation for j: 1110 Binary representation for the run: If i = 1, encoding for the run is 01 If i = 0, encoding for the run is 00 Chapter 5

Decoding Example (1) The sequence to decode: 11101101001011 Step 1
Finding the first 0 at 4th bit, so j = 4 ( ) is the first run, which is 13 Step 2 Remaining sequence: Finding the first 0 at first bit, so j = 1 00(0 + 0) is the second run, which means 0 Chapter 5

Decoding Example (2) Any trailing 0’s are not recovered (WHY ???)
Step 3 Remaining sequence: 1011 Finding the first 0 at second bit, so j = 2 1011( ) is the third run, which means 3 Result Entire sequence of run-lengths is thus 13, 0, 3 means Any trailing 0’s are not recovered (WHY ???) We can guess the # of trailing 0’s if we know # of records But, there is no use of it Chapter 5

Encoding Example (gold-jewelry)
Indexes for the first three ages, 25, 30, 45 are , , The run-length sequences for the indexes (0,7), (7), (1,7) Encoded bit sequences , , Chapter 5

Operating on Run-Length-Encoded Bit-Vectors
Step 1: Decoding encoded bit-vectors Step 2: Operating on the original bit-vectors We do not have to do the decoding all at once Chapter 5

Operation Example Example (OR operation) In “gold-jewelry” example
Two encoded bit-vectors for ages 25 and 30 and Steps Decode the first run: 0 and 7, respectively The first 1’s in the original bit-vectors: 1 and 8 Generate 1 in position 1 of the output Next, decode the next run for age 25, since there may be other 1 before at position 8 Repeating above steps Chapter 5

Managing Bitmap Indexes
Three important issues Finding bit-vectors Finding records Handling modifications to the data file Chapter 5

How to Find Bit-Vectors
Think of each bit-vector as a record Whose key is the value corresponding to this bit-vector Secondary index technique (i.e, a B-tree) can be efficiently used B-tree easily supports range queries How to store bit vectors Treat them as variable-length records Chapter 5

How to Find Records Think of the kth records as having search-key value k (although this key does not actually appear in the record) Chapter 5

Handling Modifications to the Data File
Two aspects of problem Record numbers must remain fixed once assigned Changes to the data file require the bitmap index to change as well Deletion/insert/modification Read the textbook for details Chapter 5

Chapter 5. Multidimensional Indexes

Similar presentations

Presentation on theme: "Chapter 5. Multidimensional Indexes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 5. Multidimensional Indexes

Similar presentations

Presentation on theme: "Chapter 5. Multidimensional Indexes"— Presentation transcript:

Similar presentations

About project

Feedback